INDEX
Explanations
statements emphasizing the significance or necessity of various subjects or concepts
New Auto-Interp
Negative Logits
ptrdiff
-0.17
abus
-0.16
rieg
-0.15
irting
-0.15
erty
-0.15
cul
-0.15
ild
-0.15
reh
-0.15
inqu
-0.15
issy
-0.14
POSITIVE LOGITS
importance
0.28
/import
0.24
Importance
0.23
role
0.19
significance
0.17
Attached
0.15
Role
0.15
/utility
0.15
/effects
0.15
-role
0.15
Activations Density 0.015%