INDEX
Explanations
abstract concepts or goals
concepts and terms related to objectives or claims
New Auto-Interp
Negative Logits
ctors
-0.67
umbn
-0.67
idth
-0.63
utters
-0.61
idates
-0.61
headers
-0.60
ãĥ³ãĤ¸
-0.60
condem
-0.59
srf
-0.59
usha
-0.57
POSITIVE LOGITS
ourselves
0.95
myself
0.90
firsthand
0.82
.<
0.76
yourself
0.75
unconsciously
0.73
vividly
0.73
empir
0.72
ality
0.71
manually
0.71
Activations Density 0.240%