[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
on- vs. off- policy
- To: Rao Kambhampati <rao@asu.edu>
- Subject: on- vs. off- policy
- From: Subbarao Kambhampati <rao@asu.edu>
- Date: Mon, 8 Nov 2010 15:24:41 -0700
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=/jIqx5MFNo64FnIcToezyay1WGME310ud+SS5yywDO4=; b=Xb7pirwiH2BoiFMV72Dn+sRVlmqcuGBi4sUeRNPg/4nmssy2jlvE1pLNWC2m7CGPPK Pic6DEBCCsTfI10dua4vQ/5KHFUTx6JAkX6npw43lYyRX2EoiwI/4ZFz6Ans0SZhjk9V ENQlcJPp1lP70erlz2sqky1gHTtvuhpefIxA4=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=XTM7mL8XnP58Jafn/xY6Ggfgmds6V5v71ikVIWKqf0LWmPNNufHN84mkifOp3sBoR1 H4qKkgLWWxSO2/G3J0L2i3BVlO6E1D5das8ecyzyJuJlCp1ctHPzxihqGBRcmyBiqgHD B08bkEp3bKHPjnMUptJF8yw+jDNl1mRVd3oGk=
- Sender: subbarao2z2@gmail.com
I used the terms on-policy and off-policy wrongly in the class.
An on-policy agent actually follows the policy and computes the values of the policy being computed. An off-policy agent gets experience through the policy being followed, but computes the values with bellman equations (and thus learns the optimal values even when being guided by an inferior policy).
Thus, SARSA is on-policy while Q-learning that uses best Q-value is off-policy.
I modified the corresponding slide to make this clear.
sorry for the confusion
rao