INDEX
Explanations
self | identity | consciousness
New Auto-Interp
Negative Logits
P
0.70
N
0.64
W
0.60
H
0.59
L
0.58
ذلك
0.58
T
0.55
G
0.53
ع
0.53
í
0.52
POSITIVE LOGITS
Identity
0.80
identity
0.75
IDENTITY
0.70
۰
0.66
0
0.62
identity
0.59
Identity
0.59
Culture
0.57
०
0.54
identité
0.53
Activations Density 0.632%