INDEX
Explanations
references to lies and deception
instances of the word "lie" and its variations in different contexts
New Auto-Interp
Negative Logits
arta
-0.78
orr
-0.62
illion
-0.61
lished
-0.61
weaving
-0.60
Attempts
-0.60
200000
-0.59
=-=-=-=-=-=-=-=-
-0.58
liking
-0.57
andal
-0.57
POSITIVE LOGITS
lies
1.22
utenant
0.92
lie
0.80
poons
0.75
creen
0.74
showc
0.71
layer
0.69
chool
0.69
HF
0.67
ogyn
0.67
Activations Density 0.003%