Grafada quyidagilar mavjud


Download 1.81 Mb.
bet6/6
Sana03.06.2020
Hajmi1.81 Mb.
#114284
1   2   3   4   5   6
Bog'liq
Irisboyev Asadbek CAL001 13-variant

Dinamik dasturlash

Dinamik dasturlash (DP) bu murakkab muammolarni yechish uslubidir. DPda murakkab muammolarni birma-bir echishning o'rniga, biz muammoni oddiy kichik dasturlarga ajratamiz, keyin har bir kichik muammo uchun biz echimni hisoblaymiz va saqlaymiz. Agar bir xil kichik dastur bo'lsa, biz hisoblamaymiz, buning o'rniga biz allaqachon hisoblangan echimdan foydalanamiz.

Ikki kuchli algoritm yordamida Bellman tenglamasini echamiz:

Qiymat qaytarilishi

Siyosatni takrorlash

Qiymat qaytarilishi



Biz buni diagrammalar va dasturlar yordamida bilib olamiz.

Qiymat iteratsiyasidan biz tasodifiy qiymat funktsiyasi bilan boshlaymiz. Agar tasodifiy ishga tushirilsa, qiymat jadvali optimallashtirilmaganligi sababli uni iterativ ravishda optimallashtiramiz.



Dasturlashni boshlaylik, buning uchun biz ochiq ai sport zali va uyqudan foydalanamiz.

import gym




import numpy as np




#make environment




env = gym.make('FrozenLake-v0')




# as the environment is continues there cannot be finite number of states




states = env.observation_space.n #used if discrete environment









#check number of actions that can be




actions = env.action_space.n









#initialize value table randomly




value_table = np.zeros((states,1))









def value_iterations(env , n_iterations , gamma = 1.0 , threshold = 1e-30):




for i in range(n_iterations):









new_valuetable = np.copy(value_table)




for state in range(states):




q_value = []




for action in range(actions):




next_state_reward = []




for next_state_parameters in env.env.P[state][action]:




transition_prob, next_state, reward_prob, _ = next_state_parameters




reward = transition_prob*(reward_prob+gamma*new_valuetable[next_state])




next_state_reward.append(reward)














q_value.append((np.sum(next_state_reward)))




value_table[state] = max(q_value)









if (np.sum(np.fabs(new_valuetable - value_table))<=threshold):




break




return value_table














def extract_policy(value_table, gamma = 1.0):




policy = np.zeros(env.observation_space.n)




for state in range(env.observation_space.n):




Q_table = np.zeros(env.action_space.n)




for action in range(env.action_space.n):




for next_sr in env.env.P[state][action]:




transition_prob, next_state, reward_prob, _ = next_sr




Q_table[action] += (transition_prob * (reward_prob + gamma *value_table[next_state]))




policy[state] = np.argmax(Q_table)




return policy




value_table = value_iterations(env,10000)




policy = extract_policy(value_table)




print(policy)

Download 1.81 Mb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling