INDEX
Explanations
sentences that express harm or potential danger
New Auto-Interp
Head Attr Weights
0:0.07
1:0.03
2:0.07
3:0.03
4:0.06
5:0.04
6:0.16
7:0.05
8:0.09
9:0.28
10:0.03
11:0.04
Negative Logits
Antar
-3.82
Laos
-3.73
Ser
-3.61
Magikarp
-3.58
Planes
-3.55
Mickey
-3.54
Nar
-3.48
nar
-3.43
Marine
-3.43
Stones
-3.42
POSITIVE LOGITS
Rebecca
8.65
becca
6.04
Reb
4.60
beck
3.68
abe
3.66
Reb
3.66
bern
3.53
resp
3.52
eva
3.45
Rev
3.37
Activations Density 0.001%