INDEX
Explanations
concepts related to societal criticism and personal accountability
New Auto-Interp
Negative Logits
compan
-0.16
極
-0.15
icari
-0.15
_utilities
-0.15
ego
-0.15
oret
-0.14
-0.13
ãģĭãģª
-0.13
ibase
-0.13
سر
-0.13
POSITIVE LOGITS
somehow
0.50
supposedly
0.32
allegedly
0.29
Somehow
0.28
magically
0.27
Ñıк
0.26
supposed
0.24
therefore
0.23
myster
0.23
blah
0.23
Activations Density 0.857%