INDEX
Explanations
concepts related to deception and betrayal
New Auto-Interp
Negative Logits
IZED
-0.15
ALLY
-0.14
regon
-0.14
ÑģÑĤеÑĢ
-0.14
ARGIN
-0.13
osate
-0.13
emu
-0.13
raig
-0.13
uzzer
-0.13
olest
-0.13
POSITIVE LOGITS
ing
1.75
ING
1.02
ingt
0.68
ingen
0.54
инг
0.54
ë§ģ
0.48
ãĥ³ãĤ°
0.47
ting
0.46
ings
0.46
ingo
0.46
Activations Density 0.513%