INDEX
Explanations
leading to choice or consequence
New Auto-Interp
Negative Logits
noodles
0.39
हालांकि
0.39
bristles
0.39
oats
0.38
doubts
0.38
bearings
0.38
zigzag
0.38
knuckles
0.38
analogs
0.38
novices
0.37
POSITIVE LOGITS
telah
0.67
选择了
0.63
把它
0.55
इन्होंने
0.54
выбрали
0.54
сделали
0.52
offenbar
0.51
hayan
0.48
उन्ह
0.48
Somehow
0.48
Activations Density 0.040%