INDEX
Explanations
phrases or statements that indicate the fabrication or creation of stories
New Auto-Interp
Negative Logits
luaj
-0.74
zik
-0.68
hens
-0.65
ridor
-0.63
Gi
-0.63
DEM
-0.63
Xi
-0.62
externalActionCode
-0.61
avery
-0.61
gnu
-0.61
POSITIVE LOGITS
ulates
0.82
excuses
0.81
ulate
0.73
ulated
0.70
itional
0.69
excuse
0.68
ulations
0.67
iframe
0.65
ulating
0.64
stories
0.62
Activations Density 0.021%