INDEX
Explanations
instances of pretending or pretending-related actions
instances of the word "pretend" and its variations
New Auto-Interp
Negative Logits
cedented
-0.66
aird
-0.63
otype
-0.61
aic
-0.61
ergy
-0.60
ilings
-0.60
uterte
-0.59
cutting
-0.58
atl
-0.58
Citation
-0.58
POSITIVE LOGITS
innocence
1.01
ignorance
0.85
allegiance
0.78
otherwise
0.76
pas
0.72
antly
0.66
insanity
0.65
forgot
0.64
equival
0.64
ingly
0.62
Activations Density 0.055%