INDEX
Explanations
references to care and responsibility for oneself and others
New Auto-Interp
Negative Logits
otal
-0.16
åIJĪæł¼
-0.15
abor
-0.15
uras
-0.15
Bart
-0.14
Raid
-0.14
andon
-0.14
l
-0.14
obb
-0.13
partials
-0.13
POSITIVE LOGITS
ooter
0.17
yntax
0.17
lsen
0.16
sick
0.16
azor
0.15
yandan
0.14
GED
0.14
èħ
0.14
Sick
0.14
infrastructure
0.14
Activations Density 0.072%