INDEX
Explanations
words associated with deceptive or misleading behavior
New Auto-Interp
Negative Logits
Lal
-0.18
Loren
-0.16
Lilly
-0.16
æ½
-0.15
onden
-0.15
ÄŁine
-0.14
Liquid
-0.14
Lair
-0.14
нев
-0.14
Lowe
-0.14
POSITIVE LOGITS
les
0.56
led
0.52
ling
0.45
LES
0.44
ler
0.43
lers
0.38
lesh
0.38
ledo
0.33
le
0.31
LED
0.30
Activations Density 0.069%