INDEX
Explanations
references to self-perception and self-importance
ego and arrogance
New Auto-Interp
Negative Logits
edades
-0.46
pengantin
-0.39
Jerusalén
-0.38
œurs
-0.38
Détails
-0.38
vědět
-0.37
vejec
-0.36
jména
-0.36
Inglaterra
-0.35
jangkau
-0.35
POSITIVE LOGITS
OGND
0.58
betweenstory
0.55
ego
0.52
claim
0.52
Ego
0.51
RTCK
0.50
SourceChecksum
0.50
arrogance
0.49
claim
0.49
arrogant
0.48
Activations Density 0.041%