INDEX
Explanations
words related to manners and behavior towards others
New Auto-Interp
Negative Logits
GOODMAN
-0.86
disproportion
-0.73
arnaev
-0.73
senal
-0.69
VL
-0.68
ilion
-0.68
Layer
-0.66
Offline
-0.66
idem
-0.64
LR
-0.64
POSITIVE LOGITS
embraced
0.86
entertained
0.84
parted
0.84
greeted
0.83
awaiting
0.83
inquired
0.82
awaited
0.82
welcomed
0.81
complied
0.81
accepted
0.80
Activations Density 0.073%