INDEX
Explanations
words related to potential outcomes or results of actions
mentions of consequences
New Auto-Interp
Negative Logits
uni
-0.76
erd
-0.74
zig
-0.74
ymph
-0.73
estone
-0.73
yne
-0.72
yss
-0.71
bor
-0.69
BN
-0.69
yip
-0.67
POSITIVE LOGITS
consequences
1.07
repercussions
0.88
havoc
0.86
fallout
0.86
thereof
0.82
romeda
0.81
outweigh
0.79
consequence
0.79
bringer
0.78
aval
0.78
Activations Density 0.020%