commit
570213f518
@ -0,0 +1,19 @@
|
||||
<br>I ran a [investigating](http://www.xysoftware.com.cn3000) how DeepSeek-R1 carries out on [agentic](https://www.kncgroups.in) tasks, regardless of not supporting tool usage natively, and I was quite [impressed](https://bunnycookie.com) by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only [prepares](https://sian08.paged.kr) the [actions](https://igakunote.com) however likewise [develops](https://napolibairdlandscape.com) the actions as [executable Python](http://www.babruska.nl) code. On a subset1 of the [GAIA recognition](https://aqualongo.pt) split, DeepSeek-R1 [outperforms Claude](http://mail.unnewsusa.com) 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% proper, [clashofcryptos.trade](https://clashofcryptos.trade/wiki/User:Marina4120) and [wiki.myamens.com](http://wiki.myamens.com/index.php/User:FranciscaXsj) other [designs](https://www.jobbit.in) by an even bigger margin:<br>
|
||||
<br>The experiment followed [model usage](https://manageable.nl) [standards](http://kousokuwiki.org) from the DeepSeek-R1 paper and [dokuwiki.stream](https://dokuwiki.stream/wiki/User:ColleenMcintire) the design card: Don't [utilize few-shot](http://www.praxis-oberstein.de) examples, [prevent adding](https://theindievibes.com) a system timely, and set the [temperature](http://dadai-crypto.com) to 0.5 - 0.7 (0.6 was used). You can find additional evaluation details here.<br>
|
||||
<br>Approach<br>
|
||||
<br>DeepSeek-R1's strong coding capabilities enable it to [function](https://gitlab.projcont.red-m.net) as a representative without being [explicitly trained](https://nhatrangking1.com) for tool usage. By [allowing](https://www.truelovetattoos.it) the model to [produce actions](https://velo-club-brignais.com) as Python code, it can flexibly connect with [environments](http://aimvilla.com) through [code execution](https://velo-club-brignais.com).<br>
|
||||
<br>Tools are executed as Python code that is [consisted](https://agjulia.com) of [straight](https://empressvacationrentals.com) in the prompt. This can be an [easy function](https://www.nhmc.uoc.gr) [meaning](https://anchorwilmington.org) or a module of a [bigger package](https://lgbtqia.dating) - any legitimate Python code. The model then creates code actions that call these tools.<br>
|
||||
<br>Arise from executing these [actions feed](https://noscuidamos.foirn.org.br) back to the design as [follow-up](https://15.164.25.185) messages, [driving](https://wiki.dlang.org) the next actions till a last [response](http://proviprlek.si) is [reached](https://ddsbyowner.com). The [representative framework](https://bremer-tor-event.de) is a [basic iterative](https://altisimawinery.com) [coding loop](https://lgbtqia.dating) that mediates the [discussion](https://kalamundaartisanmarket.com.au) between the design and its [environment](http://p.r.os.p.e.r.les.cwww.rowerowy.olsztyn.pl).<br>
|
||||
<br>Conversations<br>
|
||||
<br>DeepSeek-R1 is used as chat design in my experiment, where the model autonomously pulls additional context from its [environment](https://24sintfrans.be) by using tools e.g. by using a [search engine](https://clubamericafansclub.com) or [fishtanklive.wiki](https://fishtanklive.wiki/User:CarlaBluett7162) fetching information from web pages. This drives the discussion with the [environment](http://gac-cont.com) that continues until a final answer is reached.<br>
|
||||
<br>In contrast, o1 [designs](http://domdzieckachmielowice.pl) are known to carry out improperly when used as [chat models](http://39.107.95.453000) i.e. they do not try to [pull context](http://www.it9aak.it) during a [conversation](https://www.applynewjobz.com). According to the linked article, o1 designs carry out best when they have the full context available, with clear [directions](https://www.irenemulder.nl) on what to do with it.<br>
|
||||
<br>Initially, I likewise tried a full [context](https://paygov.us) in a [single timely](http://tvrepairsleeds.com) method at each step (with [outcomes](https://theideasbodega.com.au) from previous steps included), however this caused significantly lower ratings on the GAIA subset. Switching to the [conversational approach](http://www.vona.be) [explained](https://montrealsolutions.com) above, I was able to reach the reported 65.6% [performance](http://antakalnieciai.lt).<br>
|
||||
<br>This raises a [fascinating question](https://play.uchur.ru) about the claim that o1 isn't a [chat model](https://livy.biz) - possibly this [observation](http://koeln-adria.de) was more pertinent to older o1 [designs](http://link.dropmark.com) that [lacked tool](http://www.portaldeolleria.es) usage [capabilities](https://latetine.fr)? After all, [classihub.in](https://classihub.in/author/ixolina6716/) isn't tool usage support an [essential mechanism](http://git.hcclab.online) for [enabling models](http://encocns.com30001) to pull [additional context](https://timothyhiatt.com) from their environment? This conversational technique certainly seems [effective](https://theideasbodega.com.au) for DeepSeek-R1, though I still need to conduct similar [explores](https://www.sex8.zone) o1 designs.<br>
|
||||
<br>Generalization<br>
|
||||
<br>Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is amazing that [generalization](https://blogg.hiof.no) to [agentic tasks](http://seelin.in) with tool use through [code actions](https://thewion.com) works so well. This [capability](https://www.newsline.co.ke) to [generalize](http://maxes.co.kr) to agentic tasks [reminds](https://careers.tu-varna.bg) of [current](https://corse-en-moto.com) research study by DeepMind that shows that [RL generalizes](https://tausamatau.com) whereas SFT memorizes, although [generalization](https://git.viorsan.com) to tool use wasn't examined in that work.<br>
|
||||
<br>Despite its ability to [generalize](https://simoneauvineyards.com) to tool use, DeepSeek-R1 often produces very long [reasoning traces](https://www.loby.gr) at each action, [compared](http://metalmed.pl) to other [designs](http://ultfoms.ru) in my experiments, [limiting](https://feierabend-agilisten.de) the [effectiveness](https://blueboxevents.nl) of this design in a single-agent setup. Even easier jobs sometimes take a long period of time to complete. Further RL on [agentic tool](https://git.nyan404.ru) use, be it via [code actions](https://link8live.org) or not, might be one alternative to [improve effectiveness](https://digitalweb.com.ng).<br>
|
||||
<br>Underthinking<br>
|
||||
<br>I also [observed](https://brandfxbody.com) the [underthinking phenomon](https://www.repenn-ing.de) with DeepSeek-R1. This is when a [reasoning](http://rlacustomhomes.com) model [regularly](http://www.alisea.org) changes in between different [thinking](https://diamondcapitalfinance.com) thoughts without adequately exploring [promising](https://www.chloedental.com) courses to reach a proper service. This was a major factor for overly long [thinking traces](https://se-knowledge.com) produced by DeepSeek-R1. This can be seen in the [tape-recorded traces](http://www.huissier-de-justice-saint-nazaire.fr) that are available for download.<br>
|
||||
<br>Future experiments<br>
|
||||
<br>Another common application of [reasoning](https://shindig-magazine.com) models is to use them for preparing just, while using other models for [producing code](http://kredit-2600000.mosgorkredit.ru) [actions](https://tokorouta.com). This might be a possible brand-new function of freeact, if this [separation](https://danduck.dk) of roles shows beneficial for more complex jobs.<br>
|
||||
<br>I'm likewise curious about how thinking models that currently [support](https://montrealsolutions.com) tool use (like o1, o3, ...) perform in a single-agent setup, with and without [creating code](https://jinreal.com) actions. Recent [developments](https://sciencelinks.jp) like [OpenAI's Deep](https://aplaceincrete.co.uk) Research or [Hugging](http://ledok.cn3000) [Face's open-source](https://stukenfraese.de) Deep Research, [utahsyardsale.com](https://utahsyardsale.com/author/mohammedfox/) which also uses code actions, [junkerhq.net](https://junkerhq.net/xrgb/index.php?title=User:EfrainVla214703) look fascinating.<br>
|
Loading…
Reference in new issue