aOCRdataset001 bOCRdataset0001 cOCRdataset1n dCoNLLdataset1n eTesterrorfor1nonCoNLL fMatchingdataset0001Figure1TheshadedareasforthestochasticmethodsBCFWSSGa ID: 232864
Download Pdf The PPT/PDF document "Block-CoordinateFrank-WolfeOptimizationf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs (a)OCRdataset,=0:01. (b)OCRdataset,=0:001. (c)OCRdataset,=1=n. (d)CoNLLdataset,=1=n. (e)Testerrorfor=1=nonCoNLL. (f)Matchingdataset,=0:001.Figure1.Theshadedareasforthestochasticmethods(BCFW,SSGandonline-EG)indicatetheworstandbestobjectiveachievedin10randomizedruns.Thetoprowcomparesthesuboptimalityachievedbydierentsolversfordierentregularizationparameters.Forlarge(a),thestochasticalgorithms(BCFWandSSG)performconsiderablybetterthanthebatchsolvers(cuttingplaneandFW).Forasmall(c),eventhebatchsolversachievealowerobjectiveearlieronthanSSG.OurproposedBCFWalgorithmachievesalowobjectiveinbothsettings.(d)showstheconvergenceforCoNLLwiththerstpassesinmoredetails.HereBCFWalreadyresultsinalowobjectiveevenafterseeingonlyfewdatapoints.Theadvantageislessclearforthetesterrorin(e)though,whereSSG-wavgdoessurprisinglywell.Finally,(f)comparesthemethodsforthematchingpredictiontask.inthecaption,whileadditionalexperimentscanbefoundinAppendixF.Inmostoftheexperiments,theBCFW-wavgmethoddominatesallcompetitors.Thesuperiorityisespeciallystrikingfortherstfewitera-tions,andwhenusingasmallregularizationstrength,whichisoftenneededinpractice.Intermoftesterror,apeculiarobservationisthattheweightedav-erageoftheiteratesseemstohelpbothmethodssig-nicantly:SSG-wavgsometimesslightlyoutperformsBCFW-wavgdespitehavingtheworstobjectivevalueamongstallmethods.Thisphenomenonisworthfur-therinvestigation.7.RelatedWorkTherehasbeensubstantialworkondualcoordinatedescentforSVMs,includingtheoriginalsequentialminimaloptimization(SMO)algorithm.TheSMOal-gorithmwasgeneralizedtostructuralSVMs(Taskar,2004,Chapter6),butitsconvergenceratescalesbadlywiththesizeoftheoutputspace:itwasestimatedasO(njYj=")inZhangetal.(2011).Further,thismethodrequiresanexpectationoracletoworkwithitsfactoreddualparameterization.Asinouralgo-rithm,Rousuetal.(2006)proposeupdatingonetrain-ingexampleatatime,butusingmultipleFrank-Wolfeupdatestooptimizealongthesubspace.However,theydonotobtainanyrateguaranteesandtheiralgo-rithmislessgeneralbecauseitagainrequiresanex-pectationoracle.InthedegeneratebinarySVMcase,ourblock-coordinateFrank-Wolfealgorithmisactu-allyequivalenttothemethodofHsiehetal.(2008),wherebecauseeachdatapointhasauniquedualvari-able,exactcoordinateoptimizationcanbeaccom-plishedbytheline-searchstepofouralgorithm.Hsiehetal.(2008)showalocallinearconvergencerateinthedual,andourresultscomplementtheirsbyprovidingaglobalprimalconvergenceguaranteefortheiralgo-rithmofO(1=").AfterourpaperhadappearedonarXiv,Shalev-Shwartz&Zhang(2012)haveproposedageneralizationofdualcoordinatedescentapplicabletoseveralregularizedlosses,includingthestructuralSVMobjective.Despitebeingmotivatedfromadier-entperspective,aversionoftheiralgorithm(OptionIIofFigure1)givestheexactsamestep-sizeandupdatedirectionasBCFWwithline-search,andtheirCorol- Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Table1.Convergenceratesgiveninthenumberofcallstotheoraclesfordierentoptimizationalgorithmsforthestruc-turalSVMobjective(1)inthecaseofaMarkovrandomeldstructure,toreachaspecicaccuracy"measuredfordierenttypesofgaps,intermofthenumberoftrainingexamplesn,regularizationparameter,sizeofthelabelspacejYj,max-imumfeaturenormR:=maxi;yk i(y)k2(someminortermswereignoredforsuccinctness).Tableinspiredfrom(Zhangetal.,2011).Noticethatonlystochasticsubgradientandourproposedalgorithmhaveratesindependentofn. Optimizationalgorithm OnlinePrimal/DualTypeofguaranteeOracletype #Oraclecalls dualextragradient(Taskaretal.,2006) noprimal-`dual'saddlepointgapBregmanprojection OnRlogjYj " onlineexponentiatedgradient(Collinsetal.,2008) yesdualexpecteddualerrorexpectation O(n+logjYj)R2 " excessivegapreduction(Zhangetal.,2011) noprimal-dualdualitygapexpectation OnRq logjYj " BMRM(Teoetal.,2010) noprimalprimalerrormaximization OnR2 " 1-slackSVM-Struct(Joachimsetal.,2009) noprimal-dualdualitygapmaximization OnR2 " stochasticsubgradient(Shalev-Shwartzetal.,2010a) yesprimalprimalerrorw.h.p.maximization ~OR2 " thispaper:block-coordinateFrank-Wolfe yesprimal-dualexpecteddualitygapmaximization OR2 "Thm.3 lary3givesasimilarconvergencerateasourTheo-rem3.Balamuruganetal.(2011)proposetoapprox-imatelysolveaquadraticproblemoneachexampleusingSMO,buttheydonotprovideanyrateguar-antees.Theonline-EGmethodimplementsavariantofdualcoordinatedescent,butitrequiresanexpecta-tionoracleandCollinsetal.(2008)estimateitsprimalconvergenceatonlyO1="2.Besidescoordinatedescentmethods,avarietyofotheralgorithmshavebeenproposedforstructuralSVMs.WesummarizeafewofthemostpopularinTable1,withtheirconvergenceratesquotedinnumberofora-clecallstoreachanaccuracyof".However,wenotethatalmostnoguaranteesaregivenfortheoptimiza-tionofstructuralSVMswithapproximateoracles.AregretanalysisinthecontextofonlineoptimizationwasconsideredbyRatlietal.(2007),buttheydonotanalyzetheeectofthisonsolvingtheoptimizationproblem.ThecuttingplanealgorithmofTsochan-taridisetal.(2005)wasconsideredwithapproximatemaximizationbyFinley&Joachims(2008),thoughthedependenceoftherunningtimeonthetheapprox-imationerrorwasleftunclear.Incontrast,wepro-videguaranteesforbatchsubgradient,cuttingplane,andblock-coordinateFrank-Wolfe,forachievingan"-approximatesolutionaslongastheerroroftheoracleisappropriatelybounded.8.DiscussionThisworkproposesanovelrandomizedblock-coordinategeneralizationoftheclassicFrank-Wolfealgorithmforoptimizationwithblock-separablecon-straints.Despiteitspotentiallymuchloweriterationcost,thenewalgorithmachievesasimilarconvergencerateinthedualitygapasthefullFrank-Wolfemethod.ForthedualstructuralSVMoptimizationproblem,itleadstoasimpleonlinealgorithmthatyieldsasolu-tiontoanissuethatisnotoriouslydiculttoaddressforstochasticalgorithms:nostep-sizesequenceneedstobetunedsincetheoptimalstep-sizecanbee-cientlycomputedinclosed-form.Further,atthecostofanadditionalpassthroughthedata(whichcouldbedonealongsideafullFrank-Wolfeiteration),ital-lowsustocomputeadualitygapguaranteethatcanbeusedtodecidewhentoterminatethealgorithm.Ourexperimentsindicatethatempiricallyitconvergesfasterthanotherstochasticalgorithmsforthestruc-turalSVMproblem,especiallyintherealisticsettingwhereonlyafewpassesthroughthedataarepossible.AlthoughourstructuralSVMexperimentsuseanexactmaximizationoracle,thedualitygapguaran-tees,theoptimalstep-size,andacomputableboundonthedualitygapareallstillavailablewhenonlyanappropriateapproximatemaximizationoracleisused.Finally,althoughthestructuralSVMproblemiswhatmotivatedthiswork,weexpectthattheblock-coordinateFrank-Wolfealgorithmmaybeusefulforotherproblemsinmachinelearningwhereacomplexobjectivewithblock-separableconstraintsarises.Acknowledgements.WethankFrancisBach,BerndGartnerandRonnyLussforhelpfuldiscussions,andRobertCarneckyforthe3Dillustration.MJissupportedbytheERCProjectSIPA,andbytheSwissNationalScienceFoundation.SLJandMSarepartlysupportedbytheERC(SIERRA-ERC-239993).SLJissupportedbyaResearchinParisfellowship.MSissupportedbyaNSERCpostdoctoralfellowship. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs ReferencesBach,F.,Lacoste-Julien,S.,andObozinski,G.Ontheequivalencebetweenherdingandconditionalgradiental-gorithms.InICML,2012.Balamurugan,P.,Shevade,S.,Sundararajan,S.,andKeerthi,S.AsequentialdualmethodforstructuralSVMs.InSDM,2011.Caetano,T.S.,McAuley,J.J.,Cheng,Li,Le,Q.V.,andSmola,A.J.Learninggraphmatching.IEEEPAMI,31(6):1048{1058,2009.Clarkson,K.Coresets,sparsegreedyapproximation,andtheFrank-Wolfealgorithm.ACMTransactionsonAl-gorithms,6(4):1{30,2010.Collins,M.,Globerson,A.,Koo,T.,Carreras,X.,andBartlett,P.L.Exponentiatedgradientalgorithmsforconditionalrandomeldsandmax-marginMarkovnet-works.JMLR,9:1775{1822,2008.Dunn,J.C.andHarshbarger,S.Conditionalgradientalgo-rithmswithopenloopstepsizerules.JournalofMathe-maticalAnalysisandApplications,62(2):432{444,1978.Finley,T.andJoachims,T.TrainingstructuralSVMswhenexactinferenceisintractable.InICML,2008.Frank,M.andWolfe,P.Analgorithmforquadraticpro-gramming.NavalResearchLogisticsQuarterly,3:95{110,1956.Gartner,B.andJaggi,M.Coresetsforpolytopedistance.ACMSymposiumonComputationalGeometry,2009.Hsieh,C.,Chang,K.,Lin,C.,Keerthi,S.,andSundarara-jan,S.Adualcoordinatedescentmethodforlarge-scalelinearSVM.InICML,pp.408{415,2008.Jaggi,M.Sparseconvexoptimizationmethodsformachinelearning.PhDthesis,ETHZurich,2011.Jaggi,M.RevisitingFrank-Wolfe:Projection-freesparseconvexoptimization.InICML,2013.Joachims,T.,Finley,T.,andYu,C.Cutting-planetrainingofstructuralSVMs.MachineLearn.,77(1):27{59,2009.Lacoste-Julien,S.,Schmidt,M.,andBach,F.AsimplerapproachtoobtaininganO(1/t)convergenceratefortheprojectedstochasticsubgradientmethod.TechnicalReport1212.2002v2[cs.LG],arXiv,December2012.Mangasarian,O.L.Machinelearningviapolyhedralcon-caveminimization.TechnicalReport95-20,UniversityofWisconsin,1995.Nesterov,Yurii.Eciencyofcoordinatedescentmethodsonhuge-scaleoptimizationproblems.SIAMJournalonOptimization,22(2):341{362,2012.Ouyang,H.andGray,A.FaststochasticFrank-Wolfeal-gorithmsfornonlinearSVMs.SDM,2010.Rakhlin,A.,Shamir,O.,andSridharan,K.Makinggradi-entdescentoptimalforstronglyconvexstochasticopti-mization.InICML,2012.Ratli,N.,Bagnell,J.A.,andZinkevich,M.(Online)subgradientmethodsforstructuredprediction.InAIS-TATS,2007.Rousu,J.,Saunders,C.,Szedmak,S.,andShawe-Taylor,J.Kernel-basedlearningofhierarchicalmultilabelclas-sicationmodels.JMLR,2006.Sang,E.F.T.K.andBuchholz,S.IntroductiontotheCoNLL-2000sharedtask:Chunking,2000.Shalev-Shwartz,S.andZhang,T.Proximalstochasticdualcoordinateascent.TechnicalReport1211.2717v1[stat.ML],arXiv,November2012.Shalev-Shwartz,S.,Singer,Y.,Srebro,N.,andCotter,A.Pegasos:primalestimatedsub-gradientsolverforSVM.MathematicalProgramming,127(1),2010a.Shalev-Shwartz,S.,Srebro,N.,andZhang,T.Tradingac-curacyforsparsityinoptimizationproblemswithspar-sityconstraints.SIAMJournalonOptimization,20:2807{2832,2010b.Shamir,O.andZhang,T.Stochasticgradientdescentfornon-smoothoptimization:Convergenceresultsandop-timalaveragingschemes.InICML,2013.Taskar,B.Learningstructuredpredictionmodels:Alargemarginapproach.PhDthesis,Stanford,2004.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.InNIPS,2003.Taskar,B.,Lacoste-Julien,S.,andJordan,M.I.Struc-turedprediction,dualextragradientandBregmanpro-jections.JMLR,7:1627{1653,2006.Teo,C.H.,Vishwanathan,S.V.N.,Smola,A.J.,andLe,Q.V.Bundlemethodsforregularizedriskminimization.JMLR,11:311{365,2010.Tsochantaridis,I.,Joachims,T.,Hofmann,T.,andAltun,Y.Largemarginmethodsforstructuredandinterde-pendentoutputvariables.JMLR,6:1453{1484,2005.Zhang,X.,Saha,A.,andVishwanathan,S.V.N.Ac-celeratedtrainingofmax-marginMarkovnetworkswithkernels.InALT,pp.292{307.Springer,2011. SupplementaryMaterialBlock-CoordinateFrank-WolfeOptimizationforStructuralSVMs Outline.InAppendixA,wediscussthecurvatureconstantsandcomputethemforthestructuralSVMprob-lem.InAppendixB,wegiveadditionaldetailsonapplyingtheFrank-WolfealgorithmstothestructuralSVMandprovideproofsforTheorems1and3.InthemainAppendixC,wegiveaself-containedpresentationandanalysisofthenewblock-coordinateFrank-Wolfemethod(Algorithm3),andprovethemainconvergenceTheo-rem2.InAppendixD,the`linearization'-dualitygapisinterpretedintermsofFenchelduality.Forcompleteness,weincludeashortderivationofthedualproblemtothestructuralSVMinAppendixE.Finally,wepresentinAppendixFadditionalexperimentalresultsaswellasmoredetailedinformationabouttheimplementation.A.TheCurvatureConstantsCfandC fTheCurvatureConstantCf.ThecurvatureconstantCfisgivenbythemaximumrelativedeviationoftheobjectivefunctionffromitslinearapproximations,overthedomainM(Clarkson,2010;Jaggi,2013).Formally,Cf:=supx;s2M; 2[0;1];y=x+ (sx)2 2(f(y)f(x)hyx;rf(x)i):(7)TheassumptionofboundedCfcorrespondstoaslightlyweaker,aneinvariantformofasmoothnessassumptiononf.ItisknownthatCfisupperboundedbytheLipschitzconstantofthegradientrftimesthesquareddiameterofM,foranyarbitrarychoiceofanorm(Jaggi,2013,Lemma8);butitcanalsobemuchsmaller(inparticular,whenthedimensionoftheanehullofMissmallerthantheambientspace),soitisamorefundamentalquantityintheanalysisoftheFrank-WolfealgorithmthantheLipschitzconstantofthegradient.AspointedoutbyJaggi(2013,Section2.4),Cfisinvariantunderanetransformations,asistheFrank-Wolfealgorithm.TheProductCurvatureConstantC f.ThecurvatureconceptcanbegeneralizedtooursettingofproductdomainsM:=M(1):::M(n)asfollows:overeachindividualcoordinateblock,thecurvatureisgivenbyC(i)f:=supx2M;s(i)2M(i); 2[0;1];y=x+ (s[i]x[i])2 2f(y)f(x)hy(i)x(i);r(i)f(x)i;(8)wherethenotationx[i]referstothezero-paddingofx(i)sothatx[i]2M.ByconsideringtheTaylorexpansionoff,itisnothardtoseethatalsothe`partial'curvatureC(i)fisupperboundedbytheLipschitzconstantofthepartialgradientr(i)ftimesthesquareddiameterofjustonedomainblockM(i).SeealsotheproofofLemmaA.2below.Wedenetheglobalproductcurvatureconstantasthesumofthesecurvaturesforeachblock,i.e.C f:=nXi=1C(i)f(9)ObservethatfortheclassicalFrank-Wolfecasewhenn=1,werecovertheoriginalcurvatureconstant. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs atinproblem(4)(inthemaximizationversion).ThisdierenceisgLagrange(w;)= 2wTw+1 nnXi=1~Hi(w)bT 2wTw=wTwbT+1 nnXi=1maxy2YiHi(y;w):NowrecallthatbythedenitionofAandb,wehavethat1 nHi(y;w)=(bATw)(i;y)=(rf())(i;y).Bysummingupoverallpointsandre-usingasimilarargumentasinLemmaB.1above,wegetthat1 nnXi=1maxy2YiHi(y;w)=nXi=1maxy2Yi(rf())(i;y)=maxs02Mhs0;rf()i;gLagrange(w;)=(wTAbT)+1 nnXi=1maxy2YiHi(y;w)=hrf();i+maxs02Mhs0;rf()i=hs;rf()i=g();asdenedin(5). B.3.ConvergenceAnalysisB.3.1.ConvergenceoftheBatchFrank-WolfeAlgorithm2ontheStructuralSVMDualTheorem'1.Algorithm2obtainsan"-approximatesolutiontothestructuralSVMdualproblem(4)anddualitygapg((k))"afteratmostOR2 "iterations,whereeachiterationcostsnoraclecalls.Proof.WeapplytheknownconvergenceresultsforthestandardFrank-WolfeAlgorithm1,asgivene.g.in(Frank&Wolfe,1956;Dunn&Harshbarger,1978;Jaggi,2013),orasgivenintheparagraphjustaftertheproofofTheoremC.1:Foreachk1,theiterate(k)ofAlgorithm1(eitherusingthepredenedstep-sizes,orusingline-search)satisesE[f((k))]f()2Cf k+2;where2Misanoptimalsolutiontoproblem(4).Furthermore,ifAlgorithm1isrunforK1iterations,thenithasaniterate(^k),1^kK,withdualitygapboundedbyE[g((^k))]6Cf K+1.Thiswasshowne.g.in(Jaggi,2013)withslightlydierentconstants,oralsoinouranalysispresentedbelow(seetheparagraphafterthegeneralizedanalysisprovidedinTheoremC.3,whenthenumberofblocksnissettoone).NowfortheSVMproblemandtheequivalentAlgorithm2,theclaimfollowsfromthecurvatureboundCf4R2 forthedualstructuralSVMobjectivefunction(4)overthedomainM:=jY1j:::jYnj,asgivenintheaboveLemmaA.1. B.3.2.ConvergenceoftheBlock-CoordinateFrank-WolfeAlgorithm4ontheStructuralSVMDualTheorem'3.IfLmax4R2 n(soh04R2 n),thenAlgorithm4obtainsan"-approximatesolutiontothestructuralSVMdualproblem(4)andexpecteddualitygapE[g((k))]"afteratmostOR2 "iterations,whereeachiterationcostsasingleoraclecall.IfLmax4R2 n,thenitrequiresatmostanadditional(constantin")numberofOnlognLmax R2stepstogetthesameerroranddualitygapguarantees,whereasthepredenedstep-sizevariantwillrequireanadditionalOnLmax "steps.Proof.Writingh0=f((0))f()fortheerroratthestartingpointusedbythealgorithm,theconvergenceTheorem2statesthatifk0andk2n "(C f+h0),thentheexpectederrorisE[f((k))]f()"and Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs gapcomponentg(i)()asdenedin(16){thetotaldualitygapthenisg()=Pig(i)()whichcanonlybecomputedifwedoabatchpassoverallthedatapoints,asexplainedinthepreviousparagraph.B.5.MoredetailsontheKernelizedAlgorithmBothAlgorithms2and4canbeusedwithkernelsbyexplicitlymaintainingthesparsedualvariables(k)insteadoftheprimalvariablesw(k).Inthiscase,theclassierisonlygivenimplicitlyasasparsecombinationofthecorrespondingkernelfunctions,i.e.w=A,where i(y)=k(xi;yi;;)k(xi;y;;)forastructuredkernelk:(XY)(XY)!R.Notethatthenumberofnon-zerodualvariablesisupper-boundedbythenumberofiterations,andsothetimetotakedotproductsgrowsquadraticallyinthenumberofiterations. AlgorithmB.1KernelizedDualBlock-CoordinateFrank-WolfeforStructuralSVM Let(0):=(ey1;:::;eyn)2M=jY1j:::jYnjand(0)=(0)fork=0:::KdoPickiuniformlyatrandominf1;:::;ngSolveyi:=argmaxy2YiHi(y;A(k))(solvetheloss-augmenteddecodingproblem(2))s(i):=eyi2M(i)(havingonlyasinglenon-zeroentry)Let :=2n k+2n,oroptimize byline-searchUpdate(k+1)(i):=(1 )(k)(i)+ s(i)(Optionally:Update(k+1):=k k+2(k)+2 k+2(k+1))(maintainaweightedaverageoftheiterates) Tocomputetheline-searchstep-size,wesimplyre-usethesameformulaasinAlgorithm4,butreconstructing(implicitly)onthe ythemissingquantitiessuchas`i=bT[i],wi=A[i]andw(k)=A(k),andre-interpretingdotproductssuchaswTiw(k)asthesuitablesumofkernelevaluations(whichhasO(k2=n)terms,wherekisthenumberofiterationssincethebeginning).C.AnalysisoftheBlock-CoordinateFrank-WolfeAlgorithm3Thissectiongivesaself-containedpresentationandanalysisofthenewblock-coordinateFrank-Wolfeoptimiza-tionAlgorithm3.ThemaingoalistoprovetheconvergenceTheorem2,whichhereissplitintotwoparts,theprimalconvergencerateinTheoremC.1,andtheprimal-dualconvergencerateinTheoremC.3.Finally,wewillpresentafasterconvergenceresultfortheline-searchvariantinTheoremC.4andTheoremC.5,whichwehaveusedintheconvergenceforthestructuralSVMcaseaspresentedaboveinTheorem3.CoordinateDescentMethods.Despitetheirsimplicityandveryearlyappearanceintheliterature,surpris-inglyfewresultswereknownontheconvergence(andconvergenceratesinparticular)ofcoordinatedescenttypemethods.Recently,theinterestinthesemethodshasgrownagainduetotheirgoodscalabilitytoverylargescaleproblemsase.g.inmachinelearning,andalsosparkednewtheoreticalresultssuchas(Nesterov,2012).ConstrainedConvexOptimizationoverProductDomains.Weconsiderthegeneralconstrainedconvexoptimizationproblemminx2Mf(x)(10)overaCartesianproductdomainM=M(1):::M(n)Rm,whereeachfactorM(i)Rmiisconvexandcompact,andPni=1mi=m.Wewillwritex(i)2Rmiforthei-thblockofcoordinatesofavectorx2Rm,andx[i]forthepaddingofx(i)withzerossothatx[i]2Rm.Nesterov's`HugeScale'CoordinateDescent.Iftheobjectivefunctionfisstronglysmooth(i.e.hasLipschitzcontinuouspartialgradientsr(i)f(x)2Rmi),thenthefollowingalgorithmconverges5atarateof1 k, 5Byadditionallyassumingstrongconvexityoffw.r.t.the`1-norm(globalonM,notonlyontheindividualfactors),onecanevengetlinearconvergencerates,seeagain(Nesterov,2012)andthefollow-uppaper(Richtarik&Takac,2011). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Inthemultiplicativecase,wechooseaxedmultiplicativeerrorparameter01suchthatthecandidatedirectionss(i)attainthecurrent`dualitygap'onthei-thfactoruptoamultiplicativeapproximationerrorof,i.e. xs(i);r(i)f(x)maxs0(i)2M(i)Dxs0(i);r(i)f(x)E:(12)Ifamultiplicativeapproximateinternaloracleisusedtogetherwiththepredenedstep-sizeinsteadofdoingline-search,thenthestep-sizeinAlgorithmC.2needstobeincreasedto k:=2n k+2ninsteadoftheoriginal2n k+2n.Bothtypesoferrorscanbecombinedtogetherwiththefollowingpropertyforthecandidatedirections(i): xs(i);r(i)f(x)maxs0(i)2M(i)Dxs0(i);r(i)f(x)E1 2~ kC(i)f;(13)where~ k:=2n k+2n.AveragingtheIterates.IntheaboveAlgorithmC.2wehavealsoaddedanoptionallastlinewhichmaintainsthefollowingweightedaveragex(k)wwhichisdenedfork1asx(k)w:=2 k(k+1)kXt=1tx(t);(14)andbyconventionwealsodenex(0)w:=x(0).Asourconvergenceanalysiswillshow,theweightedaverageoftheiteratescanyieldmorerobustdualitygapconvergenceguaranteeswhenthedualitygapfunctiongisconvexinx(seeTheoremC.3){thisisforexamplethecaseforquadraticfunctionssuchasinthestructuralSVMobjective(4).Wewillalsoconsiderinourproofsaschemewhichaveragesthelast(1)-fractionoftheiteratesforsomexed01:x(k):=1 kdke+1kXt=dkex(t):(15)ThisiswhatRakhlinetal.(2012)calls(1)-suxaveraginganditappearedinthecontextofgettingastochasticsubgradientmethodwithO(1=k)convergencerateforstronglyconvexfunctionsinsteadofthestandardO((logk)=k)ratethatonecanprovefortheindividualiteratesx(k).Theproblemwith(1)-suxaveragingisthattoimplementitforaxed(say=0:5)withoutstoringafractionofalltheiterates,oneneedstoknowwhentheywillstopthealgorithm.AnalternativementionedinRakhlinetal.(2012)istomaintainauniformaverageoverroundsofexponentiallyincreasingsize(theso-called`doublingtrick').ThiscangiveverygoodperformancetowardstheendoftheroundsaswewillseeinouradditionalexperimentsinAppendixF,buttheperformancevarieswidelytowardsthebeginningoftherounds.Thismotivatesthesimplerandmorerobustweightedaveragingscheme(14),whichinthecaseofthestochasticsubgradientmethod,wasalsorecentlyproventohaveO(1=k)convergenceratebyLacoste-Julienetal.(2012)6andindependentlybyShamir&Zhang(2013),whocalledsuchschemes`polynomial-decayaveraging'.RelatedWork.Incontrasttotherandomizedchoiceofcoordinatewhichweusehere,theanalysisofcycliccoordinatedescentalgorithms(goingthroughtheblockssequentially)seemstobenotoriouslydicult,suchthatuntiltoday,noanalysisprovingaglobalconvergenceratehasbeenobtainedasfarasweknow.Luo&Tseng(1992)hasprovenalocallinearconvergencerateforthestronglyconvexcase.Forproductdomains,suchacyclicanalogueofourAlgorithmC.2hasalreadybeenproposedinPatriksson(1998),usingageneralizationofFrank-Wolfeiterationsunderthename`costapproximation'.TheanalysisofPatriksson(1998)showsasymptoticconvergence,butsincethemethodgoesthroughtheblockssequentially,noconvergenceratescouldbeprovensofar. 6Inthispaper,theyconsidereda(k+1)-weightinsteadofourk-weight,butsimilarratescanbeprovenforshiftedversions.Wemotivateskippingtherstiteratex(0)inourweightedaveragingschemeassometimesboundscanbeprovenonthequalityofx(1)irrespectiveofx(0)forFrank-Wolfe(seetheparagraphaftertheproofofTheoremC.1forexample,lookingatthen=1case). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs C.1.SetupforConvergenceAnalysisWereviewbelowtheimportantconceptsneededforanalyzingtheconvergenceoftheblock-coordinateFrank-WolfeAlgorithmC.2.DecompositionoftheDualityGap.Theproductstructureofourdomainhasacrucialeectonthedualitygap,namelythatitdecomposesintoasumoverthencomponentsofthedomain.The`linearization'dualitygapasdenedin(5)(seealsoJaggi(2013))foranyconstrainedconvexproblemoftheaboveform(10),foraxedfeasiblepointx2M,isgivenbyg(x):=maxs2Mhxs;rf(x)i=nXi=1maxs(i)2M(i) x(i)s(i);r(i)f(x)=:nXi=1g(i)(x):(16)Curvature.Also,thecurvaturecannowbedenedontheindividualfactors,C(i)f:=supx2M;s(i)2M(i); 2[0;1];y=x+ (s[i]x[i])2 2f(y)f(x)hy(i)x(i);r(i)f(x)i:(17)Werecallthatthenotationx[i]andx(i)isdenedjustbelow(10).Wedenetheglobalproductcurvatureasthesumofthesecurvaturesforeachblock,i.e.C f:=nXi=1C(i)f:(18)C.2.PrimalConvergenceonProductDomainsThefollowingmaintheoremshowsthatafterO1 "manyiterations,AlgorithmC.2obtainsan"-approximatesolution.TheoremC.1(PrimalConvergence).Foreachk0,theiteratex(k)oftheexactvariantofAlgorithmC.2satisesE[f(x(k))]f(x)2n k+2nC f+f(x(0))f(x);FortheapproximatevariantofAlgorithmC.2withadditiveapproximationquality(11)for0,itholdsthatE[f(x(k))]f(x)2n k+2nC f(1+)+f(x(0))f(x):FortheapproximatevariantofAlgorithmC.2,withmultiplicativeapproximationquality(12)for01,itholdsthatE[f(x(k))]f(x)2n k+2n1 C f+f(x(0))f(x):Allconvergenceboundsholdbothifthepredenedstep-sizes,orline-searchisusedinthealgorithm.Herex2Misanoptimalsolutiontoproblem(10),andtheexpectationiswithrespecttotherandomchoiceofblocksduringthealgorithm.(Inotherwordsallthreealgorithmvariantsdeliverasolutionof(expected)primalerroratmost"afterO(1 ")manyiterations.)TheproofoftheabovetheoremontheconvergencerateoftheprimalerrorcruciallydependsonthefollowingLemmaC.2ontheimprovementineachiteration. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs =1.FromtheaboveLemmaC.2,weknowthatforeveryinnerstepofAlgorithmC.2andconditionedonx(k),wehavethatE[f(x(k+1) )jx(k)]f(x(k)) ng(x(k))+ 2 2nC f,wheretheexpectationisovertherandomchoiceoftheblocki(thisboundholdsindependentlywhetherline-searchisusedornot).Writingh(x):=f(x)f(x)forthe(unknown)primalerroratanypointx,thisreadsasE[h(x(k+1) )jx(k)]h(x(k)) ng(x(k))+ 2 2nC fh(x(k)) nh(x(k))+ 2 2nC f=(1 n)h(x(k))+ 2 2nC f;(20)whereinthesecondline,wehaveusedweakdualityh(x)g(x)(whichfollowsdirectlyfromthedenitionofthedualitygap,togetherwithconvexityoff).Theinequality(20)isconditionedonx(k),whichisarandomquantitygiventhepreviousrandomchoicesofblockstoupdate.Wegetadeterministicinequalitybytakingtheexpectationofbothsideswithrespecttotherandomchoiceofpreviousblocks,yielding:E[h(x(k+1) )](1 n)E[h(x(k))]+ 2 2nC f:(21)Weobservethattheresultinginequality(21)with=1isofthesameformastheoneappearinginthestandardFrank-WolfeprimalconvergenceproofsuchasinJaggi(2013),thoughwithacrucialdierenceofthe1=nfactor(andthatwearenowworkingwiththeexpectedvaluesE[h(x(k))]insteadoftheoriginalh(x(k))).Wewillthusfollowasimilarinductionargumentoverk,butwewillseethatthe1=nfactorwillyieldaslightlydierentinductionbasecase(whichforn=1canbeanalyzedseparatelytoobtainabetterbound).Tosimplifythenotation,lethk:=E[h(x(k))].Byinduction,wearenowgoingtoprovethathk2nC k+2nfork0:forthechoiceofconstantC:=1 C f+h0.Thebase-casek=0followsimmediatelyfromthedenitionofC,giventhatCh0.Nowweconsidertheinductionstepfork0.Herethebound(21)fortheparticularchoiceofstep-size k:=2n k+2n2[0;1]givenbyAlgorithmC.2givesus(thesameboundalsoholdsfortheline-searchvariant,giventhatthecorrespondingobjectivevaluef(x(k+1)Line-Search)f(x(k+1) )onlyimproves):hk+1(1 k n)hk+( k)2C 2n=(12 k+2n)hk+(2n k+2n)2C 2n(12 k+2n)2nC k+2n+(1 k+2n)22nC;whereintherstlinewehaveusedthatC fC,andinthelastinequalitywehavepluggedintheinductionhypothesisforhk.Simplyrearrangingthetermsgiveshk+12nC k+2n12 k+2n+ k+2n=2nC k+2nk+2n k+2n2nC k+2nk+2n k+2n+=2nC (k+1)+2n;whichisourclaimedboundfork0.TheanalogousclaimforAlgorithmC.2usingtheapproximatelinearprimitivewithadditiveapproximationquality(11)with~ k=2n k+2nfollowsfromexactlythesameargument,byreplacingeveryoccurrenceofC fintheproofherebyC f(1+)instead(comparetoLemmaC.2also{notethat =~ khere).Notemoreoverthatonecancombineeasilybothamultiplicativeapproximationwithanadditiveoneasin(13),andmodifytheconvergencestatementaccordingly. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs DomainsWithoutProductStructure:n=1.OuraboveconvergenceresultalsoholdsforthecaseofthestandardFrank-Wolfealgorithm,whennoproductstructureonthedomainisassumed,i.e.forthecasen=1.Inthiscase,theconstantintheconvergencecanevenbeimprovedforthevariantofthealgorithmwithoutamultiplicativeapproximation(=1),sincetheadditivetermgivenbyh0,i.e.theerroratthestartingpoint,canberemoved.Thisisbecausealreadyaftertherststep,weobtainaboundforh1whichisindependentofh0.Moreprecisely,plugging 0:=1and=1inthebound(21)whenn=1givesh10+C f(1+)C.Usingk=1asthebasecaseforthesameinductionproofasabove,weobtainthatforn=1:hk2 k+2C f(1+)forallk1;whichmatchestheconvergencerategiveninJaggi(2013).NotethatinthetraditionalFrank-Wolfesetting,i.e.n=1,ourdenedcurvatureconstantbecomesC f=Cf.Dependenceonh0.Wenotethattheonlyuseofincludingh0intheconstantC=1C f+h0wastosatisfythebasecaseintheinductionproof,atk=0.Iffromthestructureoftheproblemwecangetaguaranteethath01C f,thenthesmallerconstantC0=1C fwillsatisfythebasecaseandthewholeproofwillgothroughwithit,withoutneedingtheextrah0factor.SeealsoTheoremC.4forabetterconvergenceresultwithaweakerdependenceonh0inthecasewheretheline-searchisused.C.3.ObtainingSmallDualityGapThefollowingtheoremshowsthatafterO1 "manyiterations,AlgorithmC.2willhavevisitedasolutionwith"-smalldualitygapinexpectation.Becausetheblock-coordinateFrank-Wolfealgorithmisonlylookingatoneblockatatime,itdoesn'tknowwhatisitscurrenttruedualitygapwithoutdoingafull(batch)passoverallblocks.Withoutmonitoringthisquantity,thealgorithmcouldmisswhichiteratehadalowdualitygap.Thisiswhy,ifoneisinterestedinhavingagooddualitygap(suchasinthestructuralSVMapplication),thentheaveragingschemesconsideredin(14)and(15)becomeinteresting:thefollowingtheoremalsosaysthattheboundholdforeachoftheaveragediterates,ifthedualitygapfunctiongisconvex,whichisthecaseforexamplewhenfisaquadraticfunction.7TheoremC.3(Primal-DualConvergence).ForeachK0,thevariantsofAlgorithmC.2(eitherusingthepredenedstep-sizes,orusingline-search)willyieldatleastoneiteratex(^k)with^kKwithexpecteddualitygapboundedbyEg(x(^k))2n (K+1)C;where=3andC=1C f(1+)+f(x(0))f(x).0and01aretheapproximationqualityparametersasdenedin(13){use=0and=1fortheexactvariant.Moreover,ifthedualitygapgisaconvexfunctionofx,thentheaboveboundalsoholdsbothforEg(x(K)w)andEg(x(K)0:5)foreachK0,wherex(K)wistheweightedaverageoftheiteratesasdenedin(14)andx(K)0:5isthe0:5-suxaverageoftheiteratesasdenedin(15)with=0:5.Proof.Tosimplifynotation,wewillagaindenotetheexpectedprimalerrorandexpecteddualitygapforanyiterationk0inthealgorithmbyhk:=E[h(x(k))]:=E[f(x(k))f(x)]andgk:=E[g(x(k))]respectively.TheproofstartsagainbyusingthecrucialimprovementLemmaC.2with = k:=2n k+2ntocoverbothvariantsofAlgorithmC.2atthesametime.AsinthebeginningoftheproofofTheoremC.1,wetaketheexpectationwithrespecttox(k)inLemmaC.2andsubtractf(x)togetthatforeachk0(forthegeneralapproximatevariantofthealgorithm):hk+1hk1 n kgk+1 2n( k2+~ k k)C f=hk1 n kgk+1 2n k2C f(1+); 7Toseethatgisconvexwhenfisquadratic,werefertotheequivalencebetweenthegapg(x)andtheFencheldualityp(x)d(rf(x)))asshowninAppendixD.Thedualfunctiond()isconcave,soifrf(x))isananefunctionofx(whichisthecaseforaquadraticfunction),thendwillbeaconcavefunctionofx,implyingthatg(x)=p(x)d(rf(x)))isconvexinx,sincetheprimalfunctionpisconvex. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs since~ k k.ByisolatinggkandusingthefactthatC1C f(1+),wegetthecrucialinequalityfortheexpecteddualitygap:gkn k(hkhk+1)+ kC 2:(22)Thegeneralproofideatogetanhandleongkistotakeaconvexcombinationovermultiplek'softheinequal-ity(22),toobtainanewupperbound.Becauseaconvexcombinationofnumbersisupperboundedbyitsmaximum,weknowthatthenewboundhastoupperboundatleastoneofthegk's(thisgivestheexistence^kpartofthetheorem).Moreover,ifgisconvex,wecanalsoobtainanupperboundfortheexpecteddualitygapofthesameconvexcombinationoftheiterates.SoletfwkgKk=0beasetofnon-negativeweights,andletk:=wk=SK,whereSK:=PKk=0wk.Takingtheconvexcombinationofinequality(22)withcoecientk,wegetKXk=0kgkn KXk=0khk khk+1 k+KXk=0k kC 2=n h00 0hK+1K K+n K1Xk=0hk+1k+1 k+1k k+KXk=0k kC 2n h00 0+n K1Xk=0hk+1k+1 k+1k k+KXk=0k kC 2;(23)usinghK+10.Inequality(23)canbeseenasamasterinequalitytoderivevariousboundsongk.Inparticular,ifwedenex:=PKk=0kx(k)andwesupposethatgisconvex(whichisthecaseforexamplewhenfisaquadraticfunction),thenwehaveE[g(x)]PKk=0kgkbyconvexityandlinearityoftheexpectation.Weighted-averagingcase.Werstconsidertheweightswk=kwhichappearinthedenitionoftheweightedaverageoftheiteratesx(K)win(14)andsupposeK1.Inthiscase,wehavek=k=SKwhereSK=K(K+1)=2.Withthepredenedstep-size k=2n=(k+2n),wethenhavek+1 k+1k k=1 2nSK((k+1)((k+1)+2n)k(k+2n))=(2k+1)+2n 2nSK:Pluggingthisinthemasterinequality(23)aswellasusingtheconvergenceratehk2nC k+2nfromTheoremC.1,weobtainKXk=0kgkn SK"0+K1Xk=02nC (k+1)+2n(2k+1)+2n 2n#+KXk=02nk k+2nC 2SKnC SK"2K1Xk=01+KXk=11#=2nC (K+1)3:Hencewehaveproventheboundwith=3forK1.ForK=0,themasterinequality(23)becomesg0n h0+1 2CnC 1+1 2nsinceh0Cand1.Giventhatn1,weseethattheboundalsoholdsforK=0.Sux-averagingcase.Fortheproofofconvergenceofthe0:5-suxaveragingoftheiteratesx(K)0:5,wereferthereadertotheproofofTheoremC.5whichcanbere-usedforthiscase(seethelastparagraphoftheprooftoexplainhow). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs DomainsWithoutProductStructure:n=1.AswementionedaftertheproofoftheprimalconvergenceTheoremC.1,wenotethatifn=1,thenwecanreplaceCinthestatementofTheoremC.3byC f(1+)forK1when=1,asthenwecanensurethath1Cwhichisallwhatwasneededfortheprimalconvergenceinduction.Again,C f=Cfwhenn=1.C.4.AnImprovedConvergenceAnalysisfortheLine-SearchCaseC.4.1.ImprovedPrimalConvergenceforLine-SearchIfline-searchisused,wecanimprovetheconvergenceresultsofTheoremC.1byshowingaweakerdependenceonthestartingconditionh0thankstofasterprogressinthestartingphaseoftherstfewiterations:TheoremC.4(ImprovedPrimalConvergenceforLine-Search).Foreachkk0,theiteratex(k)oftheline-searchvariantofAlgorithmC.2(wherethelinearsubproblemissolvedwithamultiplicativeapproximationqual-ity(12)of01)satisesEf(x(k))f(x)1 2nC f (kk0)+2n(24)wherek0:=max0;log2h(x(0)) C f.(logn) isthenumberofstepsrequiredtoguaranteethatEf(x(k))f(x)1C f,withx2Mbeinganoptimalsolutiontoproblem(10),andh(x(0)):=f(x(0))f(x)istheprimalerroratthestartingpoint,andn:=1 n1isthegeometricdecreaserateoftheprimalerrorintherstphasewhilekk0|i.e.Ef(x(k))f(x)(n)kh(x(0))+C f=2forkk0.Ifthelinearsubproblemissolvedwithanadditiveapproximationquality(11)of0instead,thenreplaceallappearancesofC fabovewithC f(1+).Proof.Fortheline-searchcase,theexpectedimprovementguaranteedbyLemmaC.2forthemultiplicativeapproximationvariantofAlgorithmC.2,inexpectationasin(21),isvalidforanychoiceof 2[0;1]:Eh(x(k+1)LS)(1 n)Eh(x(k))+ 2 2nC f:(25)Becausethebound(25)holdsforany ,wearefreetochoosetheonewhichminimizesitsubjectto 2[0;1],thatis :=min1;hk C f,wherewehaveagainusedtheidenticationhk:=Eh(x(k)LS).Nowwedistinguishtwocases:If =1,thenhkC f.Byunrollingtheinequality(25)recursivelytothebeginningandusing =1ateachstep,weget:hk+11 nhk+1 2nC f1 nk+1h0+1 2nC fPkt=01 nt1 nk+1h0+1 2nC fP1t=01 nt=1 nk+1h0+1 2nC f1 1(1=n)=1 nk+1h0+1 2C f:Wethushaveageometricdecreasewithraten:=1 ninthisphase.Wethengethk1C fassoonas(n)kh0C f=2,i.e.whenklog1=n(2h0=C f)=log(2h0=C f)=log(1 n).Wethushaveobtainedalogarithmicboundonthenumberofstepsthatfallintotherstregimecasehere,i.e.wherehkisstill`large'.Hereitiscrucialtonotethattheprimalerrorhkisalwaysdecreasingineachstep,duetotheline-search,soonceweleavethisregimeofhk1C f,thenwewillneverenteritagaininsubsequentsteps.Ontheotherhand,assoonaswereachastepk(e.g.whenk=k0)suchthat 1orequivalentlyhk1C f,thenwearealwaysinthesecondphasewhere =hk C f.Pluggingthisvalueof in(25)yieldstherecurrencebound:hk+1hk1 h2k8kk0(26) Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs TheoremC.5(ImprovedPrimal-DualConvergenceforLine-Search).Letk0bedenedasinTheoremC.4.ForeachK5k0,theline-searchvariantofAlgorithmC.2willyieldatleastoneiteratex(^k)with^kKwithexpecteddualitygapboundedbyEg(x(^k))2n (K+2)C;where=3andC=1C f(1+).0and01aretheapproximationparametersasdenedin(13){use=0and=1fortheexactvariant.Moreover,ifthedualitygapgisaconvexfunctionofx,thentheaboveboundalsoholdsforEg(x(K)0:5)foreachK5k0,wherex(K)0:5isthe0:5-suxaverageoftheiteratesasdenedin(15)with=0:5.Proof.WefollowasimilarargumentasintheproofofTheoremC.3,butmakinguseofthebetterprimalconvergenceTheoremC.4aswellasusingthe0:5-suxaverageforthemasterinequality(23).LetK5k0begiven.Let k:=2n (kk0)+2nforkk0.Notethenthat~ k=2n k+2n kandsothegapinequality(22)appearingintheproofofTheoremC.3isvalidforthis k(becauseweareconsideringtheline-searchvariantofAlgorithmC.2,wearefreetochooseany 2[0;1]inLemmaC.2).Thismeansthatthemasterinequality(23)isalsovalidherewithC=1C f(1+).Weconsidertheweightswhichappearinthedenitionofthe0:5-suxaverageofiteratesx(K)0:5givenin(15),i.e.theaverageoftheiteratesx(k)fromk=Ks:=d0:5Ketok=K.Wethushavek=1=SKforKskKandk=0otherwise,whereSK=Kd0:5Ke+1.NoticethatKsk0byassumption.Withthesechoicesofkand k,themasterinequality(23)becomesKXk=0kgkn SK"hKs Ks+K1Xk=Kshk+11 k+11 k#+KXk=Ks kC 2SKn SK"C+K1Xk=Ks2nC (k+1k0)+2n( 2n)#+KXk=Ks2n (kk0)+2nC 2SK=nC SK"1+K1Xk=Ks1 k+1k0+2n=+KXk=Ks1 kk0+2n=#nC SK"1+2KXk=Ks1 kk0+2n=#2nC (K+2)"1+2KXk=Ks1 kk0+2n=#;(28)whereinthesecondlineweusedthefasterconvergenceratehk2nC (kk0)+2nfromTheoremC.4,giventhatKsk0.Inthelastline,weusedSK0:5K+1.Therestoftheproofsimplyamountstogetanupperboundof=3onthetermbetweenbracketsin(28),thusconcludingthatPKk=0kgk2nC (K+2).ThenfollowingasimilarargumentasinTheoremC.3,thiswillimplythatthereexistssomeg^ksimilarlyupperbounded(theexistencepartofthetheorem);andthatifgisconvex,wehavethatEg(x(K)0:5)isalsosimilarlyupperbounded.Wecanupperboundthesummandtermin(28)byusingthefactthatforanynon-negativedecreasingintegrablefunctionf,wehavePKk=Ksf(k)RKKs1f(t)dt.Letan:=k02n=.Usingf(k):=1 kan,wehavethatKXk=Ks1 kanZKKs11 tandt=log(tan)t=Kt=Ks1=logKan Ks1anlogKan 0:5K1an=:b(K); Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Pluggingthisconditionandtheexpression(31)forwbackintotheLagrangian,weobtaintheLagrangedualproblemmax 2 Xi2[n];y2Yii(y) i(y) n 2+Xi2[n];y2Yii(y)Li(y) ns.t.Xy2Yi(y)=18i2[n];andi(y)08i2[n];8y2Yi;whichisexactlythenegativeofthequadraticprogramclaimedin(4). F.AdditionalExperimentsComplementingtheresultspresentedinFigure1inSection6ofthemainpaper,hereweprovideadditionalexperimentalresultsaswellasgivemoreinformationabouttheexperimentalsetupused.FortheFrank-Wolfemethods,Figure2presentsresultsonOCRcomparingsettingthestep-sizebyline-searchagainstthesimplerpredenedstep-sizeschemeof k=2n=(k+2n).There,BCFWwithpredenedstep-sizesdoessimilarlyasSSG,indicatingthatmostoftheimprovementofBCFWwithline-searchoverSSGiscomingfromtheoptimalstep-sizechoice(andnotfromtheFrank-Wolfeformulationonthedual).WealsoseethatBCFWwithpredenedstep-sizescanevendoworsethanbatchFrank-Wolfewithline-searchintheearlyiterationsforsmallvaluesof.Figure3andFigure4showadditionalresultsofthestochasticsolversforseveralvaluesofontheOCRandCoNLLdatasets.Herewealsoincludethe(uniformly)averagedstochasticsubgradientmethod(SSG-avg),whichstartsaveragingatthebeginning;aswellasthe0:5-suxaveragingversionsofbothSSGandBCFW(SSG-tavgandBCFW-tavgrespectively),implementedusingthe`doublingtrick'asdescribedjustafterEquation(15)inAppendixC.The`doublingtrick'uniformlyaveragesalliteratessincethelastiterationwhichwasapowerof2,andwasdescribedbyRakhlinetal.(2012),withexperimentsforSSGinLacoste-Julienetal.(2012).Inourexperiments,BCFW-tavgsometimesslightlyoutperformstheweightedaverageschemeBCFW-wavg,butitsperformance uctuatesmorewidely,whichiswhywerecommendtheBCFW-wavg,asmentionedinthemaintext.Inourexperiments,theobjectivevalueofSSG-avgisalwaysworsethantheotherstochasticmethods(apartonline-EG),whichiswhyitwasexcludedfromthemaintext.Online-EGperformedsubstantiallyworsethantheotherstochasticsolversfortheOCRdataset,andisthereforenotincludedinthecomparisonfortheotherdatasets.9Finally,Figure5presentsadditionalresultsforthematchingapplicationfromTaskaretal.(2006). 9Theworseperformanceoftheonlineexponentiatedgradientmethodcouldbeexplainedbythefactthatitusesalog-parameterizationofthedualvariablesandsoitsiteratesareforcedtobeintheinterioroftheprobabilitysimplex,whereasweknowthattheoptimalsolutionforthestructuralSVMobjectiveliesattheboundaryofthedomainandthustheseparametersneedtogotoinnity.