INDEX
Explanations
phrases related to misinformation or deception
instances of the word "misleading" and its variants
New Auto-Interp
Negative Logits
mun
-0.71
âĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢ
-0.68
riot
-0.67
itar
-0.66
aldo
-0.65
ucha
-0.63
area
-0.62
mega
-0.62
ube
-0.59
cakes
-0.59
POSITIVE LOGITS
ingly
1.01
misleading
0.83
misled
0.80
sters
0.79
deceive
0.78
mislead
0.78
fully
0.75
ulence
0.71
Disclosure
0.69
ulent
0.69
Activations Density 0.026%