INDEX
Explanations
phrases indicating potential negative consequences or implications of actions and decisions
New Auto-Interp
Negative Logits
ointment
-0.14
jinak
-0.14
oldt
-0.13
essel
-0.13
ailability
-0.13
anske
-0.13
blob
-0.12
offending
-0.12
Attr
-0.12
záv
-0.12
POSITIVE LOGITS
consequences
0.71
effects
0.60
consequence
0.60
implications
0.57
ramifications
0.54
repercussions
0.53
impacts
0.52
Effects
0.49
impact
0.48
effects
0.47
Activations Density 0.395%