INDEX
Explanations
phrases indicating statements or claims
phrases that include claims or assertions about events or states of being
New Auto-Interp
Negative Logits
intosh
-0.67
irez
-0.67
emort
-0.63
pite
-0.63
patch
-0.61
DOWN
-0.61
leground
-0.60
ortunate
-0.60
PLUS
-0.59
Reconstruction
-0.58
POSITIVE LOGITS
behave
0.79
embody
0.76
esty
0.72
manipulate
0.71
perform
0.71
asted
0.70
satisfy
0.68
adhere
0.67
ads
0.67
speak
0.66
Activations Density 0.105%