INDEX
Explanations
questions or scenarios related to decision-making and responsibility
New Auto-Interp
Negative Logits
cule
-0.98
ahime
-0.92
iard
-0.88
lication
-0.88
ulhu
-0.87
iza
-0.87
zeb
-0.86
Lago
-0.85
fort
-0.85
pha
-0.84
POSITIVE LOGITS
happen
1.37
happens
1.16
happened
1.14
transpired
1.13
?]
1.12
happ
1.06
characterize
1.01
differe
0.94
difference
0.93
spoil
0.91
Activations Density 0.362%