INDEX
Explanations
terms related to deception or misinformation
references to misleading information or statements
New Auto-Interp
Negative Logits
mun
-0.80
area
-0.71
âĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢ
-0.71
aldo
-0.69
itar
-0.69
mega
-0.68
FM
-0.66
Merit
-0.66
dain
-0.66
ucha
-0.66
POSITIVE LOGITS
ingly
1.02
misleading
0.92
misled
0.87
mislead
0.85
deceive
0.81
misrepresent
0.78
statements
0.75
tactics
0.74
excuse
0.73
disclosures
0.72
Activations Density 0.015%