INDEX
Explanations
words related to breaking or failure
references to the term "bust" and its variations in different contexts
New Auto-Interp
Negative Logits
vomit
-0.67
WAYS
-0.65
Rouge
-0.64
Cruel
-0.62
ised
-0.62
Pradesh
-0.59
mble
-0.59
dfx
-0.58
ndra
-0.58
vomiting
-0.57
POSITIVE LOGITS
le
0.95
lar
0.92
buster
0.90
aign
0.89
enegger
0.89
neck
0.89
y
0.88
cies
0.88
ards
0.87
les
0.87
Activations Density 0.039%