INDEX
Explanations
statements or phrases indicating falsehood or deception
statements related to falsehoods, lies, or inaccuracies
New Auto-Interp
Negative Logits
daunting
-0.72
restless
-0.68
cellaneous
-0.67
teased
-0.65
melancholy
-0.64
goodbye
-0.63
finally
-0.63
someday
-0.62
aunted
-0.61
farewell
-0.61
POSITIVE LOGITS
facts
0.89
ItemImage
0.87
factual
0.83
Claim
0.83
Fact
0.83
merely
0.80
Actually
0.80
neither
0.79
untrue
0.78
misrepresent
0.77
Activations Density 0.888%