INDEX
Explanations
references to the concept of hypocrisy
New Auto-Interp
Negative Logits
rig
-0.17
orns
-0.16
Kaplan
-0.15
اÙ쨹
-0.15
礼
-0.15
iffs
-0.15
Lair
-0.15
uge
-0.15
yo
-0.14
oons
-0.14
POSITIVE LOGITS
.dy
0.15
머ëĭĪ
0.14
ody
0.14
ικο
0.14
.sy
0.14
Hass
0.14
slee
0.13
OnTrigger
0.13
ÄĽÅ¾
0.13
lop
0.13
Activations Density 0.024%