INDEX
Explanations
mentions of personal experiences and possessions
New Auto-Interp
Negative Logits
utan
-0.64
hov
-0.63
rities
-0.60
esson
-0.58
Lenin
-0.58
Created
-0.57
hire
-0.57
rior
-0.56
verages
-0.56
namese
-0.56
POSITIVE LOGITS
doors
1.30
Pandora
1.10
door
1.04
Doors
1.04
valves
1.01
gates
1.01
portals
0.89
backdoor
0.84
doors
0.80
pores
0.79
Activations Density 0.048%