1 Gabatarwa
Manyan samfuran harshe (LLMs) sun kawo sauyi ga sarrafa harshe na halitta, suna cimma aikin ɗan adam a ɗimbin ayyuka. Duk da haka, yanayinsu na akwatin baƙi yana gabatar da manyan ƙalubale na fassarawa, musamman a cikin muhimman aikace-aikace kamar kiwon lafiya da binciken shari'a inda fahimtar yanke shawarar AI ke da muhimmanci.
TokenSHAP yana magance wannan ƙalubalen ta hanyar daidaita ƙimar Shapley daga ka'idar wasa ta haɗin gwiwa don sifa mahimmancin ga tokens ɗaya ko ɓangarorin kalmomi a cikin umarnin shigarwa. Wannan yana ba da ingantaccen tsari don fahimtar yadda sassa daban-daban na shigarwa ke ba da gudummawa ga amsawar samfurin.
2 Ayyukan Da suka Danganta
2.1 Fassarar cikin Koyon Injini
Hanyoyin fassarawa an rarraba su gaba ɗaya zuwa hanyoyin akwatin baƙi da fari. Hanyoyin akwatin baƙi kamar LIME da SHAP suna ba da bayani ba tare da buƙatar samun damar cikin samfurin ba, yayin da hanyoyin akwatin fari kamar taswira mai haske dangane da gradient da yada alaƙa na Layer suna buƙatar cikakken ilimin ginin samfurin.
2.2 Fassarar cikin Sarrafa Harshe na Halitta
A cikin NLP, dabarun gani na hankali an yi amfani da su sosai, amma sau da yawa sun kasa ba da ma'auni na mahimmancin ƙima. Hanyoyin kwanan nan sun binciko hanyoyin siffanta siffa waɗanda aka ƙera musamman don samfuran harshe, ko da yake suna fuskantar ƙalubale tare da shigarwar tsayin canzawa da abubuwan dogaro na mahallin.
3 Hanyar TokenSHAP
3.1 Tsarin Ka'ida
TokenSHAP yana faɗaɗa ƙimar Shapley zuwa shigarwar rubutu masu tsayin canzawa ta hanyar ɗaukar tokens a matsayin ƴan wasa a cikin wasan haɗin gwiwa. An ayyana aikin biyan kuɗi a matsayin kamancen tsakanin fitowar samfurin tare da kuma ba tare da takamaiman rukunonin token ba.
3.2 Hanyar Samfurin Monte Carlo
Don magance rikitaccen lissafi, TokenSHAP yana amfani da samfurin Monte Carlo, yana jujjuya tokens bazuwar kuma yana ƙididdige gudummawar gefe. Wannan hanyar tana aiki yadda ya kamata tare da tsawon shigarwa yayin kiyaye garanti na ka'ida.
4 Aiwar Fasaha
4.1 Tsarin Lissafi
An ayyana ƙimar Shapley don token $i$ kamar haka:
$\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [v(S \cup \{i\}) - v(S)]$
inda $N$ shine saitin duk tokens, $S$ wani yanki ne ban da token $i$, kuma $v(S)$ aikin ƙima ne wanda ke auna ingancin fitarwar samfurin don rukunin $S$.
4.2 Algorithm da Pseudocode
def tokenshap_importance(text, model, num_samples=1000):
tokens = tokenize(text)
n = len(tokens)
shapley_values = np.zeros(n)
for _ in range(num_samples):
permutation = random_permutation(n)
for i in range(n):
S = set(permutation[:i])
with_token = model.predict(include_tokens(S | {permutation[i]}))
without_token = model.predict(include_tokens(S))
marginal_contribution = similarity(with_token, without_token)
shapley_values[permutation[i]] += marginal_contribution
return shapley_values / num_samples
5 Sakamakon Gwaji
5.1 Ma'aunin Kimantawa
An kimanta TokenSHAP ta amfani da mahimman ma'auni guda uku: daidaitawa tare da hukunce-hukuncen ɗan adam (wanda aka auna ta hanyar alaƙa da maki mahimmancin da ɗan adam ya bayyana), amincin (iyawar nuna ainihin halayyar samfurin), da daidaito (kwanciyar hankali a cikin irin wannan shigarwa).
5.2 Binciken Kwatancen
Gwaje-gwaje a cikin umarni daban-daban da gine-ginen LLM (ciki har da GPT-3, BERT, da T5) sun nuna fifikon TokenSHAP akan ma'auni kamar LIME da hanyoyin tushen hankali. Hanyar ta nuna haɓaka 25% a cikin daidaitawar ɗan adam da maki 30% mafi kyau na aminci idan aka kwatanta da hanyoyin da suka wanzu.
Daidaitawar ɗan Adam
Haɓaka 25%
Aminci
Maki 30% Mafi Kyau
Daidaito
Babban Kwanciyar Hankali
6 Bincike na Asali
TokenSHAP yana wakiltar ci gaba mai mahimmanci a cikin fassarar LLM ta hanyar haɗa ka'idar wasa da sarrafa harshe na halitta. Tushen ka'idar hanyar a cikin ƙimar Shapley yana ba da ingantacciyar hanyar lissafi don siffanta siffa, yana magance iyakokin hanyoyin tushen dabarar kamar nunin hankali. Kamar yadda CycleGAN ya gabatar da daidaiton zagayowar don fassarar hoto mara biyu, TokenSHAP ya kafa daidaito a cikin sifa mahimmancin token a cikin bambance-bambancen shigarwa daban-daban.
Hanyar samfurin Monte Carlo tana nuna ingantaccen ingancin lissafi, tana rage rikitaccen ƙididdige ƙimar Shapley daidai zuwa matakan aiki don aikace-aikacen duniya na gaske. Wannan ribar ingancin tana kwatankwacin ci gaban a cikin hanyoyin ƙima na kusa da aka gani a cikin Koyon Bayesian mai zurfi, kamar yadda aka rubuta a cikin Jaridar Binciken Koyon Injini. Iyawar hanyar don sarrafa shigarwar tsayin canzawa ya bambanta shi da dabarun siffanta siffa na gargajiya waɗanda aka ƙera don shigarwa masu girman girma.
Kimanta TokenSHAP a cikin gine-ginen samfurori da yawa yana bayyana mahimman bayanai game da halayyar LLM. Ci gaba da haɓaka a cikin daidaitawa tare da hukunce-hukuncen ɗan adam yana nuna cewa hanyar tana ɗaukar ra'ayoyin mahimmancin fahimta fiye da hanyoyin tushen hankali. Wannan ya yi daidai da binciken da ƙungiyar Stanford HAI ta gudanar, wacce ta jaddada buƙatar hanyoyin fassarawa waɗanda suka dace da hanyoyin fahimtar ɗan adam. Ma'aunin aminci sun nuna cewa TokenSHAP yana nuna ainihin ƙididdigar samfurin daidai maimakon bayar da hujjoji na baya.
Ƙarfin gani na TokenSHAP yana ba da damar aikace-aikace masu amfani a cikin gyara samfurin da injiniyan gaggawa. Ta hanyar ba da maki mahimmancin ƙima, hanyar ta wuce kimantawa na inganci da aka saba gani a cikin nunin hankali. Wannan hanyar ƙima tana goyan bayan ƙarin bincike na tsari na halayyar samfurin, kama da yadda taswirar haske suka samo asali a cikin fassarar hangen nesa. Daidaiton hanyar a cikin irin wannan shigarwa yana nuna ƙarfi, yana magance damuwa game da kwanciyar hankali na hanyoyin fassarawa da aka ɗaga a cikin wallafe-wallafen kwanan nan daga Laboratory na Kimiyyar Kwamfuta da AI na MIT.
7 Aikace-aikace da Hanyoyin Gaba
TokenSHAP yana da aikace-aikace nan take a cikin gyara samfurin, ingantaccen gaggawa, da kayan aikin ilimi don ilimin AI. Hanyoyin gaba sun haɗa da faɗaɗa hanyar zuwa samfuran nau'i-nau'i, fassarar lokaci-lokaci don AI ta zance, da haɗa kai tare da dabarun gyara samfurin. Hakanan za a iya daidaita hanyar don gano son kai na samfurin da kuma tabbatar da adalcin AI.
8 Nassoshi
- Lundberg, S. M., & Lee, S. I. (2017). Hanyar Haɗin Kai don Fassara Hasashen Samfuri. Ci gaba a cikin Tsarin Bayanai na Neural.
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Me Ya Sa Zan Amince da Kai?" Bayanin Hasashen Kowane Mai Rarraba. ACM SIGKDD.
- Vaswani, A., da sauransu (2017). Hankali Duk Abin da Kuke Bukata. Ci gaba a cikin Tsarin Bayanai na Neural.
- Zeiler, M. D., & Fergus, R. (2014). Ganowa da Fahimtar Hanyoyin Haɗin gwiwa. Taron Turai kan Hangen Nesa na Kwamfuta.
- Bach, S., da sauransu (2015). Akan Bayanin Pixel-Wise don Yanke Shawarar Masu Rarraba Marasa Layi ta Hanyar Yada Dangantakar Layer. PLoS ONE.