INDEX
Explanations
instances of the word "reason" and its variations
New Auto-Interp
Negative Logits
cat
-0.15
gow
-0.14
keyed
-0.14
kir
-0.14
Margin
-0.14
uye
-0.14
omp
-0.13
aur
-0.13
/run
-0.13
gia
-0.13
POSITIVE LOGITS
why
0.22
why
0.19
dolayı
0.16
nant
0.16
nal
0.16
upert
0.16
lessly
0.16
EO
0.16
üstü
0.16
WHY
0.15
Activations Density 0.032%