On the antiderivatives of xp/(1 − x) with an application to optimize loss functions for classification with neural networks

Supervised learning in neural nets means optimizing synaptic weights W such that outputs y ( x ; W ) for inputs x match as closely as possible the corresponding targets t from the training data set. This optimization means minimizing a loss function L ( W ) that usually motivates from maximum-likeli...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Annals of mathematics and artificial intelligence Ročník 90; číslo 4; s. 425 - 452
Hlavní autor: Knoblauch, Andreas
Médium: Journal Article
Jazyk:angličtina
Vydáno: Cham Springer International Publishing 01.04.2022
Springer Nature B.V
Témata:
ISSN:1012-2443, 1573-7470
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Supervised learning in neural nets means optimizing synaptic weights W such that outputs y ( x ; W ) for inputs x match as closely as possible the corresponding targets t from the training data set. This optimization means minimizing a loss function L ( W ) that usually motivates from maximum-likelihood principles, silently making some prior assumptions on the distribution of output errors y − t . While classical crossentropy loss assumes triangular error distributions, it has recently been shown that generalized power error loss functions can be adapted to more realistic error distributions by fitting the exponent q of a power function used for initializing the backpropagation learning algorithm. This approach can significantly improve performance, but computing the loss function requires the antiderivative of the function f ( y ) := y q − 1 /(1 − y ) that has previously been determined only for natural q ∈ ℕ . In this work I extend this approach for rational q = n /2 m where the denominator is a power of 2. I give closed-form expressions for the antiderivative ∫ f ( y ) d y and the corresponding loss function. The benefits of such an approach are demonstrated by experiments showing that optimal exponents q are often non-natural, and that error exponents q best fitting output error distributions vary continuously during learning, typically decreasing from large q > 1 to small q < 1 during convergence of learning. These results suggest new adaptive learning methods where loss functions could be continuously adapted to output error distributions during learning.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1012-2443
1573-7470
DOI:10.1007/s10472-022-09786-2