/
Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs

Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
402 views
Uploaded On 2016-02-27

Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs - PPT Presentation

aOCRdataset001 bOCRdataset0001 cOCRdataset1n dCoNLLdataset1n eTesterrorfor1nonCoNLL fMatchingdataset0001Figure1TheshadedareasforthestochasticmethodsBCFWSSGa ID: 232864

(a)OCRdataset =0:01. (b)OCRdataset =0:001. (c)OCRdataset =1=n. (d)CoNLLdataset =1=n. (e)Testerrorfor=1=nonCoNLL. (f)Matchingdataset =0:001.Figure1.Theshadedareasforthestochasticmethods(BCFW SSGa

Share:

Link:

Embed:


Presentation Transcript

Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs (a)OCRdataset,=0:01. (b)OCRdataset,=0:001. (c)OCRdataset,=1=n. (d)CoNLLdataset,=1=n. (e)Testerrorfor=1=nonCoNLL. (f)Matchingdataset,=0:001.Figure1.Theshadedareasforthestochasticmethods(BCFW,SSGandonline-EG)indicatetheworstandbestobjectiveachievedin10randomizedruns.Thetoprowcomparesthesuboptimalityachievedbydi erentsolversfordi erentregularizationparameters.Forlarge(a),thestochasticalgorithms(BCFWandSSG)performconsiderablybetterthanthebatchsolvers(cuttingplaneandFW).Forasmall(c),eventhebatchsolversachievealowerobjectiveearlieronthanSSG.OurproposedBCFWalgorithmachievesalowobjectiveinbothsettings.(d)showstheconvergenceforCoNLLwiththe rstpassesinmoredetails.HereBCFWalreadyresultsinalowobjectiveevenafterseeingonlyfewdatapoints.Theadvantageislessclearforthetesterrorin(e)though,whereSSG-wavgdoessurprisinglywell.Finally,(f)comparesthemethodsforthematchingpredictiontask.inthecaption,whileadditionalexperimentscanbefoundinAppendixF.Inmostoftheexperiments,theBCFW-wavgmethoddominatesallcompetitors.Thesuperiorityisespeciallystrikingforthe rstfewitera-tions,andwhenusingasmallregularizationstrength,whichisoftenneededinpractice.Intermoftesterror,apeculiarobservationisthattheweightedav-erageoftheiteratesseemstohelpbothmethodssig-ni cantly:SSG-wavgsometimesslightlyoutperformsBCFW-wavgdespitehavingtheworstobjectivevalueamongstallmethods.Thisphenomenonisworthfur-therinvestigation.7.RelatedWorkTherehasbeensubstantialworkondualcoordinatedescentforSVMs,includingtheoriginalsequentialminimaloptimization(SMO)algorithm.TheSMOal-gorithmwasgeneralizedtostructuralSVMs(Taskar,2004,Chapter6),butitsconvergenceratescalesbadlywiththesizeoftheoutputspace:itwasestimatedasO(njYj=")inZhangetal.(2011).Further,thismethodrequiresanexpectationoracletoworkwithitsfactoreddualparameterization.Asinouralgo-rithm,Rousuetal.(2006)proposeupdatingonetrain-ingexampleatatime,butusingmultipleFrank-Wolfeupdatestooptimizealongthesubspace.However,theydonotobtainanyrateguaranteesandtheiralgo-rithmislessgeneralbecauseitagainrequiresanex-pectationoracle.InthedegeneratebinarySVMcase,ourblock-coordinateFrank-Wolfealgorithmisactu-allyequivalenttothemethodofHsiehetal.(2008),wherebecauseeachdatapointhasauniquedualvari-able,exactcoordinateoptimizationcanbeaccom-plishedbytheline-searchstepofouralgorithm.Hsiehetal.(2008)showalocallinearconvergencerateinthedual,andourresultscomplementtheirsbyprovidingaglobalprimalconvergenceguaranteefortheiralgo-rithmofO(1=").AfterourpaperhadappearedonarXiv,Shalev-Shwartz&Zhang(2012)haveproposedageneralizationofdualcoordinatedescentapplicabletoseveralregularizedlosses,includingthestructuralSVMobjective.Despitebeingmotivatedfromadi er-entperspective,aversionoftheiralgorithm(OptionIIofFigure1)givestheexactsamestep-sizeandupdatedirectionasBCFWwithline-search,andtheirCorol- Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Table1.Convergenceratesgiveninthenumberofcallstotheoraclesfordi erentoptimizationalgorithmsforthestruc-turalSVMobjective(1)inthecaseofaMarkovrandom eldstructure,toreachaspeci caccuracy"measuredfordi erenttypesofgaps,intermofthenumberoftrainingexamplesn,regularizationparameter,sizeofthelabelspacejYj,max-imumfeaturenormR:=maxi;yk i(y)k2(someminortermswereignoredforsuccinctness).Tableinspiredfrom(Zhangetal.,2011).Noticethatonlystochasticsubgradientandourproposedalgorithmhaveratesindependentofn. Optimizationalgorithm OnlinePrimal/DualTypeofguaranteeOracletype #Oraclecalls dualextragradient(Taskaretal.,2006) noprimal-`dual'saddlepointgapBregmanprojection OnRlogjYj " onlineexponentiatedgradient(Collinsetal.,2008) yesdualexpecteddualerrorexpectation O(n+logjYj)R2 " excessivegapreduction(Zhangetal.,2011) noprimal-dualdualitygapexpectation OnRq logjYj " BMRM(Teoetal.,2010) noprimalprimalerrormaximization OnR2 " 1-slackSVM-Struct(Joachimsetal.,2009) noprimal-dualdualitygapmaximization OnR2 " stochasticsubgradient(Shalev-Shwartzetal.,2010a) yesprimalprimalerrorw.h.p.maximization ~OR2 " thispaper:block-coordinateFrank-Wolfe yesprimal-dualexpecteddualitygapmaximization OR2 "Thm.3 lary3givesasimilarconvergencerateasourTheo-rem3.Balamuruganetal.(2011)proposetoapprox-imatelysolveaquadraticproblemoneachexampleusingSMO,buttheydonotprovideanyrateguar-antees.Theonline-EGmethodimplementsavariantofdualcoordinatedescent,butitrequiresanexpecta-tionoracleandCollinsetal.(2008)estimateitsprimalconvergenceatonlyO�1="2.Besidescoordinatedescentmethods,avarietyofotheralgorithmshavebeenproposedforstructuralSVMs.WesummarizeafewofthemostpopularinTable1,withtheirconvergenceratesquotedinnumberofora-clecallstoreachanaccuracyof".However,wenotethatalmostnoguaranteesaregivenfortheoptimiza-tionofstructuralSVMswithapproximateoracles.AregretanalysisinthecontextofonlineoptimizationwasconsideredbyRatli etal.(2007),buttheydonotanalyzethee ectofthisonsolvingtheoptimizationproblem.ThecuttingplanealgorithmofTsochan-taridisetal.(2005)wasconsideredwithapproximatemaximizationbyFinley&Joachims(2008),thoughthedependenceoftherunningtimeonthetheapprox-imationerrorwasleftunclear.Incontrast,wepro-videguaranteesforbatchsubgradient,cuttingplane,andblock-coordinateFrank-Wolfe,forachievingan"-approximatesolutionaslongastheerroroftheoracleisappropriatelybounded.8.DiscussionThisworkproposesanovelrandomizedblock-coordinategeneralizationoftheclassicFrank-Wolfealgorithmforoptimizationwithblock-separablecon-straints.Despiteitspotentiallymuchloweriterationcost,thenewalgorithmachievesasimilarconvergencerateinthedualitygapasthefullFrank-Wolfemethod.ForthedualstructuralSVMoptimizationproblem,itleadstoasimpleonlinealgorithmthatyieldsasolu-tiontoanissuethatisnotoriouslydiculttoaddressforstochasticalgorithms:nostep-sizesequenceneedstobetunedsincetheoptimalstep-sizecanbee-cientlycomputedinclosed-form.Further,atthecostofanadditionalpassthroughthedata(whichcouldbedonealongsideafullFrank-Wolfeiteration),ital-lowsustocomputeadualitygapguaranteethatcanbeusedtodecidewhentoterminatethealgorithm.Ourexperimentsindicatethatempiricallyitconvergesfasterthanotherstochasticalgorithmsforthestruc-turalSVMproblem,especiallyintherealisticsettingwhereonlyafewpassesthroughthedataarepossible.AlthoughourstructuralSVMexperimentsuseanexactmaximizationoracle,thedualitygapguaran-tees,theoptimalstep-size,andacomputableboundonthedualitygapareallstillavailablewhenonlyanappropriateapproximatemaximizationoracleisused.Finally,althoughthestructuralSVMproblemiswhatmotivatedthiswork,weexpectthattheblock-coordinateFrank-Wolfealgorithmmaybeusefulforotherproblemsinmachinelearningwhereacomplexobjectivewithblock-separableconstraintsarises.Acknowledgements.WethankFrancisBach,BerndGartnerandRonnyLussforhelpfuldiscussions,andRobertCarneckyforthe3Dillustration.MJissupportedbytheERCProjectSIPA,andbytheSwissNationalScienceFoundation.SLJandMSarepartlysupportedbytheERC(SIERRA-ERC-239993).SLJissupportedbyaResearchinParisfellowship.MSissupportedbyaNSERCpostdoctoralfellowship. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs ReferencesBach,F.,Lacoste-Julien,S.,andObozinski,G.Ontheequivalencebetweenherdingandconditionalgradiental-gorithms.InICML,2012.Balamurugan,P.,Shevade,S.,Sundararajan,S.,andKeerthi,S.AsequentialdualmethodforstructuralSVMs.InSDM,2011.Caetano,T.S.,McAuley,J.J.,Cheng,Li,Le,Q.V.,andSmola,A.J.Learninggraphmatching.IEEEPAMI,31(6):1048{1058,2009.Clarkson,K.Coresets,sparsegreedyapproximation,andtheFrank-Wolfealgorithm.ACMTransactionsonAl-gorithms,6(4):1{30,2010.Collins,M.,Globerson,A.,Koo,T.,Carreras,X.,andBartlett,P.L.Exponentiatedgradientalgorithmsforconditionalrandom eldsandmax-marginMarkovnet-works.JMLR,9:1775{1822,2008.Dunn,J.C.andHarshbarger,S.Conditionalgradientalgo-rithmswithopenloopstepsizerules.JournalofMathe-maticalAnalysisandApplications,62(2):432{444,1978.Finley,T.andJoachims,T.TrainingstructuralSVMswhenexactinferenceisintractable.InICML,2008.Frank,M.andWolfe,P.Analgorithmforquadraticpro-gramming.NavalResearchLogisticsQuarterly,3:95{110,1956.Gartner,B.andJaggi,M.Coresetsforpolytopedistance.ACMSymposiumonComputationalGeometry,2009.Hsieh,C.,Chang,K.,Lin,C.,Keerthi,S.,andSundarara-jan,S.Adualcoordinatedescentmethodforlarge-scalelinearSVM.InICML,pp.408{415,2008.Jaggi,M.Sparseconvexoptimizationmethodsformachinelearning.PhDthesis,ETHZurich,2011.Jaggi,M.RevisitingFrank-Wolfe:Projection-freesparseconvexoptimization.InICML,2013.Joachims,T.,Finley,T.,andYu,C.Cutting-planetrainingofstructuralSVMs.MachineLearn.,77(1):27{59,2009.Lacoste-Julien,S.,Schmidt,M.,andBach,F.AsimplerapproachtoobtaininganO(1/t)convergenceratefortheprojectedstochasticsubgradientmethod.TechnicalReport1212.2002v2[cs.LG],arXiv,December2012.Mangasarian,O.L.Machinelearningviapolyhedralcon-caveminimization.TechnicalReport95-20,UniversityofWisconsin,1995.Nesterov,Yurii.Eciencyofcoordinatedescentmethodsonhuge-scaleoptimizationproblems.SIAMJournalonOptimization,22(2):341{362,2012.Ouyang,H.andGray,A.FaststochasticFrank-Wolfeal-gorithmsfornonlinearSVMs.SDM,2010.Rakhlin,A.,Shamir,O.,andSridharan,K.Makinggradi-entdescentoptimalforstronglyconvexstochasticopti-mization.InICML,2012.Ratli ,N.,Bagnell,J.A.,andZinkevich,M.(Online)subgradientmethodsforstructuredprediction.InAIS-TATS,2007.Rousu,J.,Saunders,C.,Szedmak,S.,andShawe-Taylor,J.Kernel-basedlearningofhierarchicalmultilabelclas-si cationmodels.JMLR,2006.Sang,E.F.T.K.andBuchholz,S.IntroductiontotheCoNLL-2000sharedtask:Chunking,2000.Shalev-Shwartz,S.andZhang,T.Proximalstochasticdualcoordinateascent.TechnicalReport1211.2717v1[stat.ML],arXiv,November2012.Shalev-Shwartz,S.,Singer,Y.,Srebro,N.,andCotter,A.Pegasos:primalestimatedsub-gradientsolverforSVM.MathematicalProgramming,127(1),2010a.Shalev-Shwartz,S.,Srebro,N.,andZhang,T.Tradingac-curacyforsparsityinoptimizationproblemswithspar-sityconstraints.SIAMJournalonOptimization,20:2807{2832,2010b.Shamir,O.andZhang,T.Stochasticgradientdescentfornon-smoothoptimization:Convergenceresultsandop-timalaveragingschemes.InICML,2013.Taskar,B.Learningstructuredpredictionmodels:Alargemarginapproach.PhDthesis,Stanford,2004.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.InNIPS,2003.Taskar,B.,Lacoste-Julien,S.,andJordan,M.I.Struc-turedprediction,dualextragradientandBregmanpro-jections.JMLR,7:1627{1653,2006.Teo,C.H.,Vishwanathan,S.V.N.,Smola,A.J.,andLe,Q.V.Bundlemethodsforregularizedriskminimization.JMLR,11:311{365,2010.Tsochantaridis,I.,Joachims,T.,Hofmann,T.,andAltun,Y.Largemarginmethodsforstructuredandinterde-pendentoutputvariables.JMLR,6:1453{1484,2005.Zhang,X.,Saha,A.,andVishwanathan,S.V.N.Ac-celeratedtrainingofmax-marginMarkovnetworkswithkernels.InALT,pp.292{307.Springer,2011. SupplementaryMaterialBlock-CoordinateFrank-WolfeOptimizationforStructuralSVMs Outline.InAppendixA,wediscussthecurvatureconstantsandcomputethemforthestructuralSVMprob-lem.InAppendixB,wegiveadditionaldetailsonapplyingtheFrank-WolfealgorithmstothestructuralSVMandprovideproofsforTheorems1and3.InthemainAppendixC,wegiveaself-containedpresentationandanalysisofthenewblock-coordinateFrank-Wolfemethod(Algorithm3),andprovethemainconvergenceTheo-rem2.InAppendixD,the`linearization'-dualitygapisinterpretedintermsofFenchelduality.Forcompleteness,weincludeashortderivationofthedualproblemtothestructuralSVMinAppendixE.Finally,wepresentinAppendixFadditionalexperimentalresultsaswellasmoredetailedinformationabouttheimplementation.A.TheCurvatureConstantsCfandC fTheCurvatureConstantCf.ThecurvatureconstantCfisgivenbythemaximumrelativedeviationoftheobjectivefunctionffromitslinearapproximations,overthedomainM(Clarkson,2010;Jaggi,2013).Formally,Cf:=supx;s2M; 2[0;1];y=x+ (s�x)2 2(f(y)�f(x)�hy�x;rf(x)i):(7)TheassumptionofboundedCfcorrespondstoaslightlyweaker,aneinvariantformofasmoothnessassumptiononf.ItisknownthatCfisupperboundedbytheLipschitzconstantofthegradientrftimesthesquareddiameterofM,foranyarbitrarychoiceofanorm(Jaggi,2013,Lemma8);butitcanalsobemuchsmaller(inparticular,whenthedimensionoftheanehullofMissmallerthantheambientspace),soitisamorefundamentalquantityintheanalysisoftheFrank-WolfealgorithmthantheLipschitzconstantofthegradient.AspointedoutbyJaggi(2013,Section2.4),Cfisinvariantunderanetransformations,asistheFrank-Wolfealgorithm.TheProductCurvatureConstantC f.ThecurvatureconceptcanbegeneralizedtooursettingofproductdomainsM:=M(1):::M(n)asfollows:overeachindividualcoordinateblock,thecurvatureisgivenbyC(i)f:=supx2M;s(i)2M(i); 2[0;1];y=x+ (s[i]�x[i])2 2�f(y)�f(x)�hy(i)�x(i);r(i)f(x)i;(8)wherethenotationx[i]referstothezero-paddingofx(i)sothatx[i]2M.ByconsideringtheTaylorexpansionoff,itisnothardtoseethatalsothe`partial'curvatureC(i)fisupperboundedbytheLipschitzconstantofthepartialgradientr(i)ftimesthesquareddiameterofjustonedomainblockM(i).SeealsotheproofofLemmaA.2below.Wede netheglobalproductcurvatureconstantasthesumofthesecurvaturesforeachblock,i.e.C f:=nXi=1C(i)f(9)ObservethatfortheclassicalFrank-Wolfecasewhenn=1,werecovertheoriginalcurvatureconstant. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs at inproblem(4)(inthemaximizationversion).Thisdi erenceisgLagrange(w; )= 2wTw+1 nnXi=1~Hi(w)�bT � 2wTw=wTw�bT +1 nnXi=1maxy2YiHi(y;w):Nowrecallthatbythede nitionofAandb,wehavethat1 nHi(y;w)=(b�ATw)(i;y)=(�rf( ))(i;y).Bysummingupoverallpointsandre-usingasimilarargumentasinLemmaB.1above,wegetthat1 nnXi=1maxy2YiHi(y;w)=nXi=1maxy2Yi(�rf( ))(i;y)=maxs02Mhs0;�rf( )i;gLagrange(w; )=(wTA�bT) +1 nnXi=1maxy2YiHi(y;w)=hrf( ); i+maxs02Mh�s0;rf( )i=h �s;rf( )i=g( );asde nedin(5). B.3.ConvergenceAnalysisB.3.1.ConvergenceoftheBatchFrank-WolfeAlgorithm2ontheStructuralSVMDualTheorem'1.Algorithm2obtainsan"-approximatesolutiontothestructuralSVMdualproblem(4)anddualitygapg( (k))"afteratmostOR2 "iterations,whereeachiterationcostsnoraclecalls.Proof.WeapplytheknownconvergenceresultsforthestandardFrank-WolfeAlgorithm1,asgivene.g.in(Frank&Wolfe,1956;Dunn&Harshbarger,1978;Jaggi,2013),orasgivenintheparagraphjustaftertheproofofTheoremC.1:Foreachk1,theiterate (k)ofAlgorithm1(eitherusingtheprede nedstep-sizes,orusingline-search)satis esE[f( (k))]�f( )2Cf k+2;where 2Misanoptimalsolutiontoproblem(4).Furthermore,ifAlgorithm1isrunforK1iterations,thenithasaniterate (^k),1^kK,withdualitygapboundedbyE[g( (^k))]6Cf K+1.Thiswasshowne.g.in(Jaggi,2013)withslightlydi erentconstants,oralsoinouranalysispresentedbelow(seetheparagraphafterthegeneralizedanalysisprovidedinTheoremC.3,whenthenumberofblocksnissettoone).NowfortheSVMproblemandtheequivalentAlgorithm2,theclaimfollowsfromthecurvatureboundCf4R2 forthedualstructuralSVMobjectivefunction(4)overthedomainM:=jY1j:::jYnj,asgivenintheaboveLemmaA.1. B.3.2.ConvergenceoftheBlock-CoordinateFrank-WolfeAlgorithm4ontheStructuralSVMDualTheorem'3.IfLmax4R2 n(soh04R2 n),thenAlgorithm4obtainsan"-approximatesolutiontothestructuralSVMdualproblem(4)andexpecteddualitygapE[g( (k))]"afteratmostOR2 "iterations,whereeachiterationcostsasingleoraclecall.IfLmax�4R2 n,thenitrequiresatmostanadditional(constantin")numberofO�nlog�nLmax R2stepstogetthesameerroranddualitygapguarantees,whereastheprede nedstep-sizevariantwillrequireanadditionalO�nLmax "steps.Proof.Writingh0=f( (0))�f( )fortheerroratthestartingpointusedbythealgorithm,theconvergenceTheorem2statesthatifk0andk2n "(C f+h0),thentheexpectederrorisE[f( (k))]�f( )"and Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs gapcomponentg(i)( )asde nedin(16){thetotaldualitygapthenisg( )=Pig(i)( )whichcanonlybecomputedifwedoabatchpassoverallthedatapoints,asexplainedinthepreviousparagraph.B.5.MoredetailsontheKernelizedAlgorithmBothAlgorithms2and4canbeusedwithkernelsbyexplicitlymaintainingthesparsedualvariables (k)insteadoftheprimalvariablesw(k).Inthiscase,theclassi erisonlygivenimplicitlyasasparsecombinationofthecorrespondingkernelfunctions,i.e.w=A ,where i(y)=k(xi;yi;;)�k(xi;y;;)forastructuredkernelk:(XY)(XY)!R.Notethatthenumberofnon-zerodualvariablesisupper-boundedbythenumberofiterations,andsothetimetotakedotproductsgrowsquadraticallyinthenumberofiterations. AlgorithmB.1KernelizedDualBlock-CoordinateFrank-WolfeforStructuralSVM Let (0):=(ey1;:::;eyn)2M=jY1j:::jYnjand (0)= (0)fork=0:::KdoPickiuniformlyatrandominf1;:::;ngSolveyi:=argmaxy2YiHi(y;A (k))(solvetheloss-augmenteddecodingproblem(2))s(i):=eyi2M(i)(havingonlyasinglenon-zeroentry)Let :=2n k+2n,oroptimize byline-searchUpdate (k+1)(i):=(1� ) (k)(i)+ s(i)(Optionally:Update (k+1):=k k+2 (k)+2 k+2 (k+1))(maintainaweightedaverageoftheiterates) Tocomputetheline-searchstep-size,wesimplyre-usethesameformulaasinAlgorithm4,butreconstructing(implicitly)onthe ythemissingquantitiessuchas`i=bT [i],wi=A [i]andw(k)=A (k),andre-interpretingdotproductssuchaswTiw(k)asthesuitablesumofkernelevaluations(whichhasO(k2=n)terms,wherekisthenumberofiterationssincethebeginning).C.AnalysisoftheBlock-CoordinateFrank-WolfeAlgorithm3Thissectiongivesaself-containedpresentationandanalysisofthenewblock-coordinateFrank-Wolfeoptimiza-tionAlgorithm3.ThemaingoalistoprovetheconvergenceTheorem2,whichhereissplitintotwoparts,theprimalconvergencerateinTheoremC.1,andtheprimal-dualconvergencerateinTheoremC.3.Finally,wewillpresentafasterconvergenceresultfortheline-searchvariantinTheoremC.4andTheoremC.5,whichwehaveusedintheconvergenceforthestructuralSVMcaseaspresentedaboveinTheorem3.CoordinateDescentMethods.Despitetheirsimplicityandveryearlyappearanceintheliterature,surpris-inglyfewresultswereknownontheconvergence(andconvergenceratesinparticular)ofcoordinatedescenttypemethods.Recently,theinterestinthesemethodshasgrownagainduetotheirgoodscalabilitytoverylargescaleproblemsase.g.inmachinelearning,andalsosparkednewtheoreticalresultssuchas(Nesterov,2012).ConstrainedConvexOptimizationoverProductDomains.Weconsiderthegeneralconstrainedconvexoptimizationproblemminx2Mf(x)(10)overaCartesianproductdomainM=M(1):::M(n)Rm,whereeachfactorM(i)Rmiisconvexandcompact,andPni=1mi=m.Wewillwritex(i)2Rmiforthei-thblockofcoordinatesofavectorx2Rm,andx[i]forthepaddingofx(i)withzerossothatx[i]2Rm.Nesterov's`HugeScale'CoordinateDescent.Iftheobjectivefunctionfisstronglysmooth(i.e.hasLipschitzcontinuouspartialgradientsr(i)f(x)2Rmi),thenthefollowingalgorithmconverges5atarateof1 k, 5Byadditionallyassumingstrongconvexityoffw.r.t.the`1-norm(globalonM,notonlyontheindividualfactors),onecanevengetlinearconvergencerates,seeagain(Nesterov,2012)andthefollow-uppaper(Richtarik&Takac,2011). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Inthemultiplicativecase,wechoosea xedmultiplicativeerrorparameter01suchthatthecandidatedirectionss(i)attainthecurrent`dualitygap'onthei-thfactoruptoamultiplicativeapproximationerrorof,i.e. x�s(i);r(i)f(x) maxs0(i)2M(i)Dx�s0(i);r(i)f(x)E:(12)Ifamultiplicativeapproximateinternaloracleisusedtogetherwiththeprede nedstep-sizeinsteadofdoingline-search,thenthestep-sizeinAlgorithmC.2needstobeincreasedto k:=2n k+2ninsteadoftheoriginal2n k+2n.Bothtypesoferrorscanbecombinedtogetherwiththefollowingpropertyforthecandidatedirections(i): x�s(i);r(i)f(x) maxs0(i)2M(i)Dx�s0(i);r(i)f(x)E�1 2~ kC(i)f;(13)where~ k:=2n k+2n.AveragingtheIterates.IntheaboveAlgorithmC.2wehavealsoaddedanoptionallastlinewhichmaintainsthefollowingweightedaveragex(k)wwhichisde nedfork1asx(k)w:=2 k(k+1)kXt=1tx(t);(14)andbyconventionwealsode nex(0)w:=x(0).Asourconvergenceanalysiswillshow,theweightedaverageoftheiteratescanyieldmorerobustdualitygapconvergenceguaranteeswhenthedualitygapfunctiongisconvexinx(seeTheoremC.3){thisisforexamplethecaseforquadraticfunctionssuchasinthestructuralSVMobjective(4).Wewillalsoconsiderinourproofsaschemewhichaveragesthelast(1�)-fractionoftheiteratesforsome xed01:x(k):=1 k�dke+1kXt=dkex(t):(15)ThisiswhatRakhlinetal.(2012)calls(1�)-suxaveraginganditappearedinthecontextofgettingastochasticsubgradientmethodwithO(1=k)convergencerateforstronglyconvexfunctionsinsteadofthestandardO((logk)=k)ratethatonecanprovefortheindividualiteratesx(k).Theproblemwith(1�)-suxaveragingisthattoimplementitfora xed(say=0:5)withoutstoringafractionofalltheiterates,oneneedstoknowwhentheywillstopthealgorithm.AnalternativementionedinRakhlinetal.(2012)istomaintainauniformaverageoverroundsofexponentiallyincreasingsize(theso-called`doublingtrick').ThiscangiveverygoodperformancetowardstheendoftheroundsaswewillseeinouradditionalexperimentsinAppendixF,buttheperformancevarieswidelytowardsthebeginningoftherounds.Thismotivatesthesimplerandmorerobustweightedaveragingscheme(14),whichinthecaseofthestochasticsubgradientmethod,wasalsorecentlyproventohaveO(1=k)convergenceratebyLacoste-Julienetal.(2012)6andindependentlybyShamir&Zhang(2013),whocalledsuchschemes`polynomial-decayaveraging'.RelatedWork.Incontrasttotherandomizedchoiceofcoordinatewhichweusehere,theanalysisofcycliccoordinatedescentalgorithms(goingthroughtheblockssequentially)seemstobenotoriouslydicult,suchthatuntiltoday,noanalysisprovingaglobalconvergenceratehasbeenobtainedasfarasweknow.Luo&Tseng(1992)hasprovenalocallinearconvergencerateforthestronglyconvexcase.Forproductdomains,suchacyclicanalogueofourAlgorithmC.2hasalreadybeenproposedinPatriksson(1998),usingageneralizationofFrank-Wolfeiterationsunderthename`costapproximation'.TheanalysisofPatriksson(1998)showsasymptoticconvergence,butsincethemethodgoesthroughtheblockssequentially,noconvergenceratescouldbeprovensofar. 6Inthispaper,theyconsidereda(k+1)-weightinsteadofourk-weight,butsimilarratescanbeprovenforshiftedversions.Wemotivateskippingthe rstiteratex(0)inourweightedaveragingschemeassometimesboundscanbeprovenonthequalityofx(1)irrespectiveofx(0)forFrank-Wolfe(seetheparagraphaftertheproofofTheoremC.1forexample,lookingatthen=1case). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs C.1.SetupforConvergenceAnalysisWereviewbelowtheimportantconceptsneededforanalyzingtheconvergenceoftheblock-coordinateFrank-WolfeAlgorithmC.2.DecompositionoftheDualityGap.Theproductstructureofourdomainhasacruciale ectonthedualitygap,namelythatitdecomposesintoasumoverthencomponentsofthedomain.The`linearization'dualitygapasde nedin(5)(seealsoJaggi(2013))foranyconstrainedconvexproblemoftheaboveform(10),fora xedfeasiblepointx2M,isgivenbyg(x):=maxs2Mhx�s;rf(x)i=nXi=1maxs(i)2M(i) x(i)�s(i);r(i)f(x) =:nXi=1g(i)(x):(16)Curvature.Also,thecurvaturecannowbede nedontheindividualfactors,C(i)f:=supx2M;s(i)2M(i); 2[0;1];y=x+ (s[i]�x[i])2 2�f(y)�f(x)�hy(i)�x(i);r(i)f(x)i:(17)Werecallthatthenotationx[i]andx(i)isde nedjustbelow(10).Wede netheglobalproductcurvatureasthesumofthesecurvaturesforeachblock,i.e.C f:=nXi=1C(i)f:(18)C.2.PrimalConvergenceonProductDomainsThefollowingmaintheoremshowsthatafterO�1 "manyiterations,AlgorithmC.2obtainsan"-approximatesolution.TheoremC.1(PrimalConvergence).Foreachk0,theiteratex(k)oftheexactvariantofAlgorithmC.2satis esE[f(x(k))]�f(x)2n k+2n�C f+f(x(0))�f(x);FortheapproximatevariantofAlgorithmC.2withadditiveapproximationquality(11)for0,itholdsthatE[f(x(k))]�f(x)2n k+2n�C f(1+)+f(x(0))�f(x):FortheapproximatevariantofAlgorithmC.2,withmultiplicativeapproximationquality(12)for01,itholdsthatE[f(x(k))]�f(x)2n k+2n�1 C f+f(x(0))�f(x):Allconvergenceboundsholdbothiftheprede nedstep-sizes,orline-searchisusedinthealgorithm.Herex2Misanoptimalsolutiontoproblem(10),andtheexpectationiswithrespecttotherandomchoiceofblocksduringthealgorithm.(Inotherwordsallthreealgorithmvariantsdeliverasolutionof(expected)primalerroratmost"afterO(1 ")manyiterations.)TheproofoftheabovetheoremontheconvergencerateoftheprimalerrorcruciallydependsonthefollowingLemmaC.2ontheimprovementineachiteration. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs =1.FromtheaboveLemmaC.2,weknowthatforeveryinnerstepofAlgorithmC.2andconditionedonx(k),wehavethatE[f(x(k+1) )jx(k)]f(x(k))�  ng(x(k))+ 2 2nC f,wheretheexpectationisovertherandomchoiceoftheblocki(thisboundholdsindependentlywhetherline-searchisusedornot).Writingh(x):=f(x)�f(x)forthe(unknown)primalerroratanypointx,thisreadsasE[h(x(k+1) )jx(k)]h(x(k))�  ng(x(k))+ 2 2nC fh(x(k))�  nh(x(k))+ 2 2nC f=(1�  n)h(x(k))+ 2 2nC f;(20)whereinthesecondline,wehaveusedweakdualityh(x)g(x)(whichfollowsdirectlyfromthede nitionofthedualitygap,togetherwithconvexityoff).Theinequality(20)isconditionedonx(k),whichisarandomquantitygiventhepreviousrandomchoicesofblockstoupdate.Wegetadeterministicinequalitybytakingtheexpectationofbothsideswithrespecttotherandomchoiceofpreviousblocks,yielding:E[h(x(k+1) )](1�  n)E[h(x(k))]+ 2 2nC f:(21)Weobservethattheresultinginequality(21)with=1isofthesameformastheoneappearinginthestandardFrank-WolfeprimalconvergenceproofsuchasinJaggi(2013),thoughwithacrucialdi erenceofthe1=nfactor(andthatwearenowworkingwiththeexpectedvaluesE[h(x(k))]insteadoftheoriginalh(x(k))).Wewillthusfollowasimilarinductionargumentoverk,butwewillseethatthe1=nfactorwillyieldaslightlydi erentinductionbasecase(whichforn=1canbeanalyzedseparatelytoobtainabetterbound).Tosimplifythenotation,lethk:=E[h(x(k))].Byinduction,wearenowgoingtoprovethathk2nC k+2nfork0:forthechoiceofconstantC:=1 C f+h0.Thebase-casek=0followsimmediatelyfromthede nitionofC,giventhatCh0.Nowweconsidertheinductionstepfork0.Herethebound(21)fortheparticularchoiceofstep-size k:=2n k+2n2[0;1]givenbyAlgorithmC.2givesus(thesameboundalsoholdsfortheline-searchvariant,giventhatthecorrespondingobjectivevaluef(x(k+1)Line-Search)f(x(k+1) )onlyimproves):hk+1(1� k n)hk+( k)2C 2n=(1�2 k+2n)hk+(2n k+2n)2C 2n(1�2 k+2n)2nC k+2n+(1 k+2n)22nC;whereinthe rstlinewehaveusedthatC fC,andinthelastinequalitywehavepluggedintheinductionhypothesisforhk.Simplyrearrangingthetermsgiveshk+12nC k+2n1�2 k+2n+ k+2n=2nC k+2nk+2n� k+2n2nC k+2nk+2n k+2n+=2nC (k+1)+2n;whichisourclaimedboundfork0.TheanalogousclaimforAlgorithmC.2usingtheapproximatelinearprimitivewithadditiveapproximationquality(11)with~ k=2n k+2nfollowsfromexactlythesameargument,byreplacingeveryoccurrenceofC fintheproofherebyC f(1+)instead(comparetoLemmaC.2also{notethat =~ khere).Notemoreoverthatonecancombineeasilybothamultiplicativeapproximationwithanadditiveoneasin(13),andmodifytheconvergencestatementaccordingly. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs DomainsWithoutProductStructure:n=1.OuraboveconvergenceresultalsoholdsforthecaseofthestandardFrank-Wolfealgorithm,whennoproductstructureonthedomainisassumed,i.e.forthecasen=1.Inthiscase,theconstantintheconvergencecanevenbeimprovedforthevariantofthealgorithmwithoutamultiplicativeapproximation(=1),sincetheadditivetermgivenbyh0,i.e.theerroratthestartingpoint,canberemoved.Thisisbecausealreadyafterthe rststep,weobtainaboundforh1whichisindependentofh0.Moreprecisely,plugging 0:=1and=1inthebound(21)whenn=1givesh10+C f(1+)C.Usingk=1asthebasecaseforthesameinductionproofasabove,weobtainthatforn=1:hk2 k+2C f(1+)forallk1;whichmatchestheconvergencerategiveninJaggi(2013).NotethatinthetraditionalFrank-Wolfesetting,i.e.n=1,ourde nedcurvatureconstantbecomesC f=Cf.Dependenceonh0.Wenotethattheonlyuseofincludingh0intheconstantC=�1C f+h0wastosatisfythebasecaseintheinductionproof,atk=0.Iffromthestructureoftheproblemwecangetaguaranteethath0�1C f,thenthesmallerconstantC0=�1C fwillsatisfythebasecaseandthewholeproofwillgothroughwithit,withoutneedingtheextrah0factor.SeealsoTheoremC.4forabetterconvergenceresultwithaweakerdependenceonh0inthecasewheretheline-searchisused.C.3.ObtainingSmallDualityGapThefollowingtheoremshowsthatafterO�1 "manyiterations,AlgorithmC.2willhavevisitedasolutionwith"-smalldualitygapinexpectation.Becausetheblock-coordinateFrank-Wolfealgorithmisonlylookingatoneblockatatime,itdoesn'tknowwhatisitscurrenttruedualitygapwithoutdoingafull(batch)passoverallblocks.Withoutmonitoringthisquantity,thealgorithmcouldmisswhichiteratehadalowdualitygap.Thisiswhy,ifoneisinterestedinhavingagooddualitygap(suchasinthestructuralSVMapplication),thentheaveragingschemesconsideredin(14)and(15)becomeinteresting:thefollowingtheoremalsosaysthattheboundholdforeachoftheaveragediterates,ifthedualitygapfunctiongisconvex,whichisthecaseforexamplewhenfisaquadraticfunction.7TheoremC.3(Primal-DualConvergence).ForeachK0,thevariantsofAlgorithmC.2(eitherusingtheprede nedstep-sizes,orusingline-search)willyieldatleastoneiteratex(^k)with^kKwithexpecteddualitygapboundedbyEg(x(^k)) 2n (K+1)C;where =3andC=�1C f(1+)+f(x(0))�f(x).0and01aretheapproximationqualityparametersasde nedin(13){use=0and=1fortheexactvariant.Moreover,ifthedualitygapgisaconvexfunctionofx,thentheaboveboundalsoholdsbothforEg(x(K)w)andEg(x(K)0:5)foreachK0,wherex(K)wistheweightedaverageoftheiteratesasde nedin(14)andx(K)0:5isthe0:5-suxaverageoftheiteratesasde nedin(15)with=0:5.Proof.Tosimplifynotation,wewillagaindenotetheexpectedprimalerrorandexpecteddualitygapforanyiterationk0inthealgorithmbyhk:=E[h(x(k))]:=E[f(x(k))�f(x)]andgk:=E[g(x(k))]respectively.TheproofstartsagainbyusingthecrucialimprovementLemmaC.2with = k:=2n k+2ntocoverbothvariantsofAlgorithmC.2atthesametime.AsinthebeginningoftheproofofTheoremC.1,wetaketheexpectationwithrespecttox(k)inLemmaC.2andsubtractf(x)togetthatforeachk0(forthegeneralapproximatevariantofthealgorithm):hk+1hk�1 n kgk+1 2n( k2+~ k k)C f=hk�1 n kgk+1 2n k2C f(1+); 7Toseethatgisconvexwhenfisquadratic,werefertotheequivalencebetweenthegapg(x)andtheFencheldualityp(x)�d(rf(x)))asshowninAppendixD.Thedualfunctiond()isconcave,soifrf(x))isananefunctionofx(whichisthecaseforaquadraticfunction),thendwillbeaconcavefunctionofx,implyingthatg(x)=p(x)�d(rf(x)))isconvexinx,sincetheprimalfunctionpisconvex. Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs since~ k k.ByisolatinggkandusingthefactthatC�1C f(1+),wegetthecrucialinequalityfortheexpecteddualitygap:gkn  k(hk�hk+1)+ kC 2:(22)Thegeneralproofideatogetanhandleongkistotakeaconvexcombinationovermultiplek'softheinequal-ity(22),toobtainanewupperbound.Becauseaconvexcombinationofnumbersisupperboundedbyitsmaximum,weknowthatthenewboundhastoupperboundatleastoneofthegk's(thisgivestheexistence^kpartofthetheorem).Moreover,ifgisconvex,wecanalsoobtainanupperboundfortheexpecteddualitygapofthesameconvexcombinationoftheiterates.SoletfwkgKk=0beasetofnon-negativeweights,andletk:=wk=SK,whereSK:=PKk=0wk.Takingtheconvexcombinationofinequality(22)withcoecientk,wegetKXk=0kgkn KXk=0khk k�hk+1 k+KXk=0k kC 2=n h00 0�hK+1K K+n K�1Xk=0hk+1k+1 k+1�k k+KXk=0k kC 2n h00 0+n K�1Xk=0hk+1k+1 k+1�k k+KXk=0k kC 2;(23)usinghK+10.Inequality(23)canbeseenasamasterinequalitytoderivevariousboundsongk.Inparticular,ifwede nex:=PKk=0kx(k)andwesupposethatgisconvex(whichisthecaseforexamplewhenfisaquadraticfunction),thenwehaveE[g(x)]PKk=0kgkbyconvexityandlinearityoftheexpectation.Weighted-averagingcase.We rstconsidertheweightswk=kwhichappearinthede nitionoftheweightedaverageoftheiteratesx(K)win(14)andsupposeK1.Inthiscase,wehavek=k=SKwhereSK=K(K+1)=2.Withtheprede nedstep-size k=2n=(k+2n),wethenhavek+1 k+1�k k=1 2nSK((k+1)((k+1)+2n)�k(k+2n))=(2k+1)+2n 2nSK:Pluggingthisinthemasterinequality(23)aswellasusingtheconvergenceratehk2nC k+2nfromTheoremC.1,weobtainKXk=0kgkn SK"0+K�1Xk=02nC (k+1)+2n(2k+1)+2n 2n#+KXk=02nk k+2nC 2SKnC SK"2K�1Xk=01+KXk=11#=2nC (K+1)3:Hencewehaveproventheboundwith =3forK1.ForK=0,themasterinequality(23)becomesg0n h0+1 2CnC 1+1 2nsinceh0Cand1.Giventhatn1,weseethattheboundalsoholdsforK=0.Sux-averagingcase.Fortheproofofconvergenceofthe0:5-suxaveragingoftheiteratesx(K)0:5,wereferthereadertotheproofofTheoremC.5whichcanbere-usedforthiscase(seethelastparagraphoftheprooftoexplainhow). Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs DomainsWithoutProductStructure:n=1.AswementionedaftertheproofoftheprimalconvergenceTheoremC.1,wenotethatifn=1,thenwecanreplaceCinthestatementofTheoremC.3byC f(1+)forK1when=1,asthenwecanensurethath1Cwhichisallwhatwasneededfortheprimalconvergenceinduction.Again,C f=Cfwhenn=1.C.4.AnImprovedConvergenceAnalysisfortheLine-SearchCaseC.4.1.ImprovedPrimalConvergenceforLine-SearchIfline-searchisused,wecanimprovetheconvergenceresultsofTheoremC.1byshowingaweakerdependenceonthestartingconditionh0thankstofasterprogressinthestartingphaseofthe rstfewiterations:TheoremC.4(ImprovedPrimalConvergenceforLine-Search).Foreachkk0,theiteratex(k)oftheline-searchvariantofAlgorithmC.2(wherethelinearsubproblemissolvedwithamultiplicativeapproximationqual-ity(12)of01)satis esEf(x(k))�f(x)1 2nC f (k�k0)+2n(24)wherek0:=max0;log2h(x(0)) C f.(�logn) isthenumberofstepsrequiredtoguaranteethatEf(x(k))�f(x)�1C f,withx2Mbeinganoptimalsolutiontoproblem(10),andh(x(0)):=f(x(0))�f(x)istheprimalerroratthestartingpoint,andn:=1� n1isthegeometricdecreaserateoftheprimalerrorinthe rstphasewhilekk0|i.e.Ef(x(k))�f(x)(n)kh(x(0))+C f=2forkk0.Ifthelinearsubproblemissolvedwithanadditiveapproximationquality(11)of0instead,thenreplaceallappearancesofC fabovewithC f(1+).Proof.Fortheline-searchcase,theexpectedimprovementguaranteedbyLemmaC.2forthemultiplicativeapproximationvariantofAlgorithmC.2,inexpectationasin(21),isvalidforanychoiceof 2[0;1]:Eh(x(k+1)LS)(1� n)Eh(x(k))+ 2 2nC f:(25)Becausethebound(25)holdsforany ,wearefreetochoosetheonewhichminimizesitsubjectto 2[0;1],thatis :=min1;hk C f,wherewehaveagainusedtheidenti cationhk:=Eh(x(k)LS).Nowwedistinguishtwocases:If =1,thenhkC f.Byunrollingtheinequality(25)recursivelytothebeginningandusing =1ateachstep,weget:hk+1�1� nhk+1 2nC f�1� nk+1h0+1 2nC fPkt=0�1� nt�1� nk+1h0+1 2nC fP1t=0�1� nt=�1� nk+1h0+1 2nC f1 1�(1�=n)=�1� nk+1h0+1 2C f:Wethushaveageometricdecreasewithraten:=1� ninthisphase.Wethengethk�1C fassoonas(n)kh0C f=2,i.e.whenklog1=n(2h0=C f)=log(2h0=C f)=�log(1� n).Wethushaveobtainedalogarithmicboundonthenumberofstepsthatfallintothe rstregimecasehere,i.e.wherehkisstill`large'.Hereitiscrucialtonotethattheprimalerrorhkisalwaysdecreasingineachstep,duetotheline-search,soonceweleavethisregimeofhk�1C f,thenwewillneverenteritagaininsubsequentsteps.Ontheotherhand,assoonaswereachastepk(e.g.whenk=k0)suchthat 1orequivalentlyhk�1C f,thenwearealwaysinthesecondphasewhere =hk C f.Pluggingthisvalueof in(25)yieldstherecurrencebound:hk+1hk�1 h2k8kk0(26) Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs TheoremC.5(ImprovedPrimal-DualConvergenceforLine-Search).Letk0bede nedasinTheoremC.4.ForeachK5k0,theline-searchvariantofAlgorithmC.2willyieldatleastoneiteratex(^k)with^kKwithexpecteddualitygapboundedbyEg(x(^k)) 2n (K+2)C;where =3andC=�1C f(1+).0and01aretheapproximationparametersasde nedin(13){use=0and=1fortheexactvariant.Moreover,ifthedualitygapgisaconvexfunctionofx,thentheaboveboundalsoholdsforEg(x(K)0:5)foreachK5k0,wherex(K)0:5isthe0:5-suxaverageoftheiteratesasde nedin(15)with=0:5.Proof.WefollowasimilarargumentasintheproofofTheoremC.3,butmakinguseofthebetterprimalconvergenceTheoremC.4aswellasusingthe0:5-suxaverageforthemasterinequality(23).LetK5k0begiven.Let k:=2n (k�k0)+2nforkk0.Notethenthat~ k=2n k+2n kandsothegapinequality(22)appearingintheproofofTheoremC.3isvalidforthis k(becauseweareconsideringtheline-searchvariantofAlgorithmC.2,wearefreetochooseany 2[0;1]inLemmaC.2).Thismeansthatthemasterinequality(23)isalsovalidherewithC=�1C f(1+).Weconsidertheweightswhichappearinthede nitionofthe0:5-suxaverageofiteratesx(K)0:5givenin(15),i.e.theaverageoftheiteratesx(k)fromk=Ks:=d0:5Ketok=K.Wethushavek=1=SKforKskKandk=0otherwise,whereSK=K�d0:5Ke+1.NoticethatKsk0byassumption.Withthesechoicesofkand k,themasterinequality(23)becomesKXk=0kgkn SK"hKs Ks+K�1Xk=Kshk+11 k+1�1 k#+KXk=Ks kC 2SKn SK"C+K�1Xk=Ks2nC (k+1�k0)+2n( 2n)#+KXk=Ks2n (k�k0)+2nC 2SK=nC SK"1+K�1Xk=Ks1 k+1�k0+2n=+KXk=Ks1 k�k0+2n=#nC SK"1+2KXk=Ks1 k�k0+2n=#2nC (K+2)"1+2KXk=Ks1 k�k0+2n=#;(28)whereinthesecondlineweusedthefasterconvergenceratehk2nC (k�k0)+2nfromTheoremC.4,giventhatKsk0.Inthelastline,weusedSK0:5K+1.Therestoftheproofsimplyamountstogetanupperboundof =3onthetermbetweenbracketsin(28),thusconcludingthatPKk=0kgk 2nC (K+2).ThenfollowingasimilarargumentasinTheoremC.3,thiswillimplythatthereexistssomeg^ksimilarlyupperbounded(theexistencepartofthetheorem);andthatifgisconvex,wehavethatEg(x(K)0:5)isalsosimilarlyupperbounded.Wecanupperboundthesummandtermin(28)byusingthefactthatforanynon-negativedecreasingintegrablefunctionf,wehavePKk=Ksf(k)RKKs�1f(t)dt.Letan:=k0�2n=.Usingf(k):=1 k�an,wehavethatKXk=Ks1 k�anZKKs�11 t�andt=log(t�an)t=Kt=Ks�1=logK�an Ks�1�anlogK�an 0:5K�1�an=:b(K); Block-CoordinateFrank-WolfeOptimizationforStructuralSVMs Pluggingthisconditionandtheexpression(31)forwbackintotheLagrangian,weobtaintheLagrangedualproblemmax � 2 Xi2[n];y2Yi i(y) i(y) n 2+Xi2[n];y2Yi i(y)Li(y) ns.t.Xy2Y i(y)=18i2[n];and i(y)08i2[n];8y2Yi;whichisexactlythenegativeofthequadraticprogramclaimedin(4). F.AdditionalExperimentsComplementingtheresultspresentedinFigure1inSection6ofthemainpaper,hereweprovideadditionalexperimentalresultsaswellasgivemoreinformationabouttheexperimentalsetupused.FortheFrank-Wolfemethods,Figure2presentsresultsonOCRcomparingsettingthestep-sizebyline-searchagainstthesimplerprede nedstep-sizeschemeof k=2n=(k+2n).There,BCFWwithprede nedstep-sizesdoessimilarlyasSSG,indicatingthatmostoftheimprovementofBCFWwithline-searchoverSSGiscomingfromtheoptimalstep-sizechoice(andnotfromtheFrank-Wolfeformulationonthedual).WealsoseethatBCFWwithprede nedstep-sizescanevendoworsethanbatchFrank-Wolfewithline-searchintheearlyiterationsforsmallvaluesof.Figure3andFigure4showadditionalresultsofthestochasticsolversforseveralvaluesofontheOCRandCoNLLdatasets.Herewealsoincludethe(uniformly)averagedstochasticsubgradientmethod(SSG-avg),whichstartsaveragingatthebeginning;aswellasthe0:5-suxaveragingversionsofbothSSGandBCFW(SSG-tavgandBCFW-tavgrespectively),implementedusingthe`doublingtrick'asdescribedjustafterEquation(15)inAppendixC.The`doublingtrick'uniformlyaveragesalliteratessincethelastiterationwhichwasapowerof2,andwasdescribedbyRakhlinetal.(2012),withexperimentsforSSGinLacoste-Julienetal.(2012).Inourexperiments,BCFW-tavgsometimesslightlyoutperformstheweightedaverageschemeBCFW-wavg,butitsperformance uctuatesmorewidely,whichiswhywerecommendtheBCFW-wavg,asmentionedinthemaintext.Inourexperiments,theobjectivevalueofSSG-avgisalwaysworsethantheotherstochasticmethods(apartonline-EG),whichiswhyitwasexcludedfromthemaintext.Online-EGperformedsubstantiallyworsethantheotherstochasticsolversfortheOCRdataset,andisthereforenotincludedinthecomparisonfortheotherdatasets.9Finally,Figure5presentsadditionalresultsforthematchingapplicationfromTaskaretal.(2006). 9Theworseperformanceoftheonlineexponentiatedgradientmethodcouldbeexplainedbythefactthatitusesalog-parameterizationofthedualvariablesandsoitsiteratesareforcedtobeintheinterioroftheprobabilitysimplex,whereasweknowthattheoptimalsolutionforthestructuralSVMobjectiveliesattheboundaryofthedomainandthustheseparametersneedtogotoin nity.