INDEX
Explanations
references to human beings and their characteristics
New Auto-Interp
Negative Logits
RAG
-0.72
eryl
-0.71
urations
-0.69
liga
-0.68
forth
-0.68
kick
-0.67
arella
-0.66
è¦
-0.66
chell
-0.65
REP
-0.65
POSITIVE LOGITS
beings
1.45
oids
1.16
itarian
1.11
itar
1.08
readable
1.00
embryonic
0.97
istic
0.93
zee
0.93
oid
0.91
fra
0.89
Activations Density 0.025%