INDEX
Explanations
terms and concepts related to responsibility and accountability
New Auto-Interp
Negative Logits
isu
-0.17
irony
-0.16
unknown
-0.16
zm
-0.15
chu
-0.15
ستÛĮ
-0.15
ironic
-0.14
bag
-0.14
elp
-0.14
adic
-0.13
POSITIVE LOGITS
why
0.23
reasons
0.23
why
0.21
Reasons
0.21
Reason
0.21
Why
0.20
WHY
0.20
reason
0.20
.reason
0.20
reason
0.19
Activations Density 0.005%