INDEX
Explanations
phrases emphasizing collective responsibility and shared experiences
New Auto-Interp
Negative Logits
never
-0.21
not
-0.19
không
-0.19
neither
-0.18
tidak
-0.18
cannot
-0.18
nicht
-0.18
ä¸įä¼ļ
-0.17
doesn
-0.17
æīĢæľī
-0.17
POSITIVE LOGITS
uded
0.29
ude
0.25
uding
0.23
alike
0.22
ready
0.21
ayed
0.21
udes
0.20
-important
0.19
ways
0.19
LLLL
0.19
Activations Density 0.079%