INDEX
Explanations
pronouns and self-references
New Auto-Interp
Negative Logits
ذلك
0.67
reducing
0.66
it
0.65
while
0.65
While
0.63
disrupting
0.63
적으로
0.62
sha
0.62
am
0.61
USA
0.59
POSITIVE LOGITS
zelf
0.86
Примечания
0.81
Personally
0.73
缀
0.72
<unused369>
0.72
جميعا
0.71
self
0.69
所有人
0.69
倆
0.69
<unused59>
0.68
Activations Density 0.245%