INDEX
Explanations
references to the field of science or scientific concepts
words related to conscience or ethical considerations
New Auto-Interp
Negative Logits
stage
-0.79
lift
-0.68
ting
-0.68
ton
-0.67
lain
-0.65
stood
-0.63
nikov
-0.62
managed
-0.61
TON
-0.61
tolerate
-0.60
POSITIVE LOGITS
ences
1.00
zona
0.87
ppo
0.86
pe
0.85
ptions
0.85
ardo
0.84
ption
0.84
oglu
0.84
emi
0.83
otti
0.83
Activations Density 0.023%