wasshuber

Programming Machine Learning: a tip and a gotcha

Tip: If you are on a slow or old machine like me, or if you want to run many different examples to explore the design space you can speed up the calculations by removing a border from the MNIST image data. Every image has a 1-pixel white border. Removing this border reduces the number of input variables by 108 or more than 13%. In fact, you can drop even a 3-pixel border without any impact that I can notice. Dropping more is also possible, but then the expected max accuracy will also start to drop. But it is quite remarkable that even using only the innermost 8x8 image fragment one can easily get above 80% accuracy.

Gotcha: I have run the one hidden layer with 100 nodes scenario with the original test set of 10,000 examples. I did not split it into the 5,000 for validation and 5,000 for testing. I was surprised that the maximum accuracy I could achieve was only 97.8%, not the 98.6% stated in the book. However, this is purely an effect of the training set. When I did the splitting into validation and testing set with 5,000 for testing I got the 98.6% accuracy with the same network weights. This was surprising to me, that there is that big a change in accuracy due to the size of the test set.

23 comments

/book-programming-machine-learning

9 948 23

2023-03-06 14:59:39 UTC

Most Liked

wasshuber

Another tip that seems to be helping speed up training: I do a batch-size ramp. I start with batches of about 2-3 times the class size (for MNIST class size is 10). For example, I start with batch size of 20. I double the batch size with each epoch until I reach the final batch size of my choice and then continue with this batch size until the end.

The advantage here is that at the beginning when the weights are far away from their optimum, it is not necessary to have a particularly good estimator for the gradient, thus small batch sizes are fine and faster. But as we are approaching the optimum larger batch sizes are helpful to get an accurate gradient.

This reduces the importance of setting a proper batch size. One can take a larger batch size without negatively impacting the final accuracy of the model. Large batch size can sometimes mean that one gets stuck in a local minimum and the final accuracy of the model suffers. Ramping the batch size combines the advantages of small and large batch sizes.

Post #12

wasshuber

If you like numerical issues then I will describe a problem I chassed for 3 days. During implementing dropout regularization I encountered an issue with the implementation of softmax that cost me three days delay. In your book the implementation of softmax is fine but basic. Meaning it does not protect against over- or underflow issues with the exponentials. What some do, for example, is to subtract the maximum value first before the exponential is applied. Mathematically this is equivalent because it is simply a multiplication of a constant factor of the numerator and denominator in the softmax formula. Nothing changes. Online I even found Python code for it that was something like

e = np.exp(x - np.max(x))

The problem with this code is subtle but numerically it is stupid. What happens is the following. np.max(x) returns the maximum from the entire matrix, meaning the maximum in the entire mini-batch. But we only need the maximum for each input (image) and not across several inputs. Numerically this causes problems because in some cases it can push the argument of the exponential so far to negative values that they all underflow and all exponentials return zero. The solution for this is to implement it such that the maximum subtracted is only the row maximum not the maximum across the entire mini-batch. Something like

e = np.exp(x - np.max(x,axis=1).reshape(-1,1))

This numerical issue manifested itself in the following way. Initially, the network was training perfectly fine. It reached about the accuracy it should reach. Then the accuracy started to drop, first slowly but then very quickly, and over the course of a few epochs the entire network blew up with all weights increasing until everything was saturated. Nothing could stop it. I tried clipping the gradients and limiting the weights norms, etc. The issue was the above-mentioned bad implementation of the softmax function.

Post #7

wasshuber

I discovered this myself by experimenting with all kinds of activation functions. It was easy to change the code from sigmoid to other activation functions and I was curious about what changes if I used different functions. I tried some really weird ones, too.

This is why I choose your path of coding it myself because then it is much easier to change the things I wanted to change. With a library, one is in a straight-jacket and one can only change what the library allows you to change.

What made me analyze it more carefully was the fact that this shifted ReLU learned better in combination with dropout. So I tried to see why and noticed that the magnitude of the weights going from layer to layer stayed about the same when with ReLU they keep growing. I don’t have any good explanation for why this is better except that if there is a sort of additional bias the weights have to learn (their magnitude increases with deeper layers) then this will take longer in the learning process than if they do not have to learn this bias.

Then again, this is such a simple modification that I would be surprised if nobody has tried this before and noted the improvement. Searching online I do see shifted ReLUs being mentioned in lists of activation functions, but I have not found anything that mentions the improvement to learning they achieve and how this may be connected to the weight magnitude staying the same. We should also not forget that I only applied this to the MNIST data set. I don’t know if my observations hold in general.

Post #11

Where Next?

View thread on forum

View Programming Machine Learning's book portal

Post errata for this book

Post a suggestion for this book

Post a question about this book

Post a review about this book

Home PragProg Customers

/book-programming-machine-learning

9 948 23

Last post

Popular Pragmatic Bookshelf topics

PragProg Customers

Python Testing with pytest - Chapter 5 example c AttributeError: 'module' object has no attribute 'config'

Running the examples in chapter 5 c under pytest 5.4.1 causes an AttributeError: ‘module’ object has no attribute ‘config’. In particula...

#errata /book-python-testing-with-pytest

5 3542 1

2020-07-14 20:43:55 UTC

New

PragProg Customers

Web Development with Clojure, Third Edition: Issue adding Shadow CLJS

Title: Web Development with Clojure, Third Edition, pg 116 Hi - I just started chapter 5 and I am stuck on page 116 while trying to star...

#question /book-web-development-with-clojure-third-edition

4 1191 4

2021-01-10 05:30:17 UTC

New

PragProg Customers

Modern Front-End Development for Rails: Can't get TURBO_STREAM format in some cases

This isn’t directly about the book contents so maybe not the right forum…but in some of the code apps (e.g. turbo/06) it sends a TURBO_ST...

#question /book-modern-front-end-development-for-rails

1 1364 7

2021-04-04 21:22:11 UTC

New

PragProg Customers

A Common-Sense Guide to Data Structures and Algorithms, Second Edition: (pg460)

Hello! Thanks for the great book. I was attempting the Trie (chap 17) exercises and for number 4 the solution provided for the autocorre...

#errata /book-a-common-sense-guide-to-data-structures-and-algorithms-second-edition

0 1148 3

2023-03-10 05:02:03 UTC

New

PragProg Customers

Kotlin and Android Development featuring Jetpack: SwitchCompat's thumbTint and trackTint being ignored (page 76)

I think I might have found a problem involving SwitchCompat, thumbTint, and trackTint. As entered, the SwitchCompat changes color to hol...

#errata /book-kotlin-and-android-development-featuring-jetpack

0 2644 2

2022-05-13 11:00:18 UTC

New

PragProg Customers

Kotlin and Android Development featuring Jetpack: android:tint vs app:tint (chapter 7, p186)

I found an issue in Chapter 7 regarding android:backgroundTint vs app:backgroundTint. How to replicate: load chapter-7 from zipfile i...

#errata /book-kotlin-and-android-development-featuring-jetpack

0 3494 3

2022-04-26 09:10:52 UTC

New

PragProg Customers

Powerful Command-Line Applications in Go: Exercise Solutions (many pages)

Is there any place where we can discuss the solutions to some of the exercises? I can figure most of them out, but am having trouble with...

#question /book-powerful-command-line-applications-in-go

17 1372 9

2022-02-09 15:00:31 UTC

New

PragProg Customers

Real-Time Phoenix: use PubSub with implementation of Phoenix.Socket.Transport rather than channels (page 37)

Hi, I’ve got a question about the implementation of PubSub when using a Phoenix.Socket.Transport behaviour rather than channels. Before ...

#question /book-real-time-phoenix #websockets #phoenixsocket #pubsub

0 1500 3

2022-02-28 15:41:06 UTC

New

PragProg Customers

The Ray Tracer Challenge: Cannot get rid of acne When implement Shadows in Chapter 8.

Hi, I’m working on the Chapter 8 of the book. After I add add the point_offset, I’m still able to see acne: In the image above, I re...

#question /book-the-ray-tracer-challenge

1 1021 4

2024-07-09 23:10:34 UTC

New

PragProg Customers

PragProg Customers Kotlin and Android Development: pg. 8 Build Gradle error: Could not get unknown property 'kotlin_version'

I am using Android Studio Chipmunk | 2021.2.1 Patch 2 Build #AI-212.5712.43.2112.8815526, built on July 10, 2022 Runtime version: 11.0....

#question /book-kotlin-and-android-development-featuring-jetpack

0 1396 2

2022-08-29 14:30:50 UTC

New

Programming Machine Learning: a tip and a gotcha

wasshuber

Programming Machine Learning: a tip and a gotcha

Most Liked

wasshuber

wasshuber

wasshuber

Where Next?

Popular Pragmatic Bookshelf topics

Python Testing with pytest - Chapter 5 example c AttributeError: 'module' object has no attribute 'config'

Web Development with Clojure, Third Edition: Issue adding Shadow CLJS

Modern Front-End Development for Rails: Can't get TURBO_STREAM format in some cases

A Common-Sense Guide to Data Structures and Algorithms, Second Edition: (pg460)

Kotlin and Android Development featuring Jetpack: SwitchCompat's thumbTint and trackTint being ignored (page 76)

Kotlin and Android Development featuring Jetpack: android:tint vs app:tint (chapter 7, p186)

Powerful Command-Line Applications in Go: Exercise Solutions (many pages)

Real-Time Phoenix: use PubSub with implementation of Phoenix.Socket.Transport rather than channels (page 37)

The Ray Tracer Challenge: Cannot get rid of acne When implement Shadows in Chapter 8.

PragProg Customers Kotlin and Android Development: pg. 8 Build Gradle error: Could not get unknown property 'kotlin_version'

Other popular topics

What is the reason behind Rust’s web framework, Rocket, not performing as well as expected in the Techempower benchmarks?

Obsidian – a cross platform app to help you create a knowledge base with Markdown files

How fast do you type? Check your WPM here!

Safari now supports File System Access API with private origin

Spotlight: Jamis Buck (Author) Interview and AMA!

The overengineered Solution to my Pigeon Problem

How to fix the eyes in AI-generated images

Spotlight: David Bryant Copeland (Author) Interview and AMA!

Do you prefer regular mechanical keyboards or low profile mechanical keyboards and why?

X can’t stop spread of explicit, fake AI Taylor Swift images

Latest in Programming Machine Learning

PragProg Customers

Latest on Devtalk

We ❤️ helpful members!

Devtalk Sponsors

Categories:

Sub Categories:

Popular Portals

Devtalk Sponsors

We're in Beta