INDEX
Explanations
attends to the tokens indicating a significant outcome or concept from corresponding tokens representing a contrasting or opposing idea
New Auto-Interp
Head Attr Weights
0:0.03
1:0.09
2:0.05
3:0.03
4:0.19
5:0.47
6:0.05
7:0.05
Negative Logits
kaarangay
-0.44
ostavi
-0.40
Personendaten
-0.39
Roskov
-0.38
adpleegd
-0.38
Tikang
-0.37
telefónica
-0.36
détru
-0.35
betweenstory
-0.35
createSlice
-0.35
POSITIVE LOGITS
opus
0.24
Wikimédia
0.23
Premios
0.22
[
0.21
portál
0.21
neme
0.21
[]
0.20
heur
0.20
aggio
0.20
mée
0.20
Activations Density 0.331%