INDEX
Explanations
references to subjective experiences or states
New Auto-Interp
Negative Logits
thing
-0.16
597
-0.14
idious
-0.14
689
-0.14
ruba
-0.14
umba
-0.13
899
-0.13
Zaman
-0.13
ittest
-0.13
edList
-0.13
POSITIVE LOGITS
ologically
0.29
ough
0.23
oret
0.20
tas
0.20
way
0.19
eway
0.18
ore
0.18
instant
0.18
orie
0.17
same
0.16
Activations Density 0.087%