Brain Inspired Neuromorphic Computer Spinnaker Overheated When Coolers Losttheir Chill

Exclusive The brain-inspired SpiNNaker machine at Manchester University in England suffered an overheating incident over the Easter weekend that will send a chill down the spines of datacenter administrators.

brain

Brain-inspired chips promise ultra-efficient AI, so why aren’t they everywhere?

READ MORE

According to Professor Steve Furber, now retired (although he told El Reg “SpiNNaker is still seen as my baby!”), a failure with the cooling on April 20 led to a rise in temperatures until the servers were manually shut down the following day.

The SpiNNaker (Spiking Neural Network Architecture) project is all about simulating a brain by connecting hundreds of thousands of Arm cores. While a human brain presents a huge challenge, Furber, one of the designers of the original Arm processor, reckoned a mouse brain was possible.

During a talk earlier this month to celebrate the 40th anniversary of the switch-on of the first Arm processor, Furber told the audience that the hope was to model “one whole mouse” at the required level of detail.

Assuming the hardware survived its baking.

“SpiNNaker,” he told The Register, “is hosted in the Kilburn Building, which was completed in 1972 as a purpose-built computer building and, as such, has a plant room that supplies chilled water as a utility to all the central machine rooms.

“The SpiNNaker room was built to house the machine in 2016 in what used to be the mechanical workshop, and is cooled by circulating hot air from the back of the cabinets through a plenum chamber into chillers at either end that blow the air through a cooling system using the building’s chilled water.”

The problem was with the chilled water supply. Furber said, “If the chilled water isn’t actually chilled, the chiller fans are adding to the problem rather than helping solve it.”

And so the temperatures began to rise inexorably. Without an automatic shutdown, the servers struggled on. Furber told us that he believed there was an automatic over-temperature shutdown on the individual SpiNNaker boards, and said, “This may have protected the SpiNNaker hardware from damage,” but even with the hard-to-replace boards off, the network switches and power supplies remained powered up.

The latter two component types suffered some damage, and without them, the SpiNNaker boards cannot all be tested, “so there may be more issues hidden behind the ones we know about.”

Furber added, “We have had a few issues with the cooling system in the nine years that the machine has been operational, but these have not previously led to any damage.” He reckoned that the long Easter weekend (in the UK, where Easter Friday and Easter Monday are both public holidays) might have contributed to the length of time it took to contain the temperature rise.

“We are looking into ways to fully automate the shutdown process in the future!”

As for the system’s current state, Furber told us, “The machine is back up for internal users at around 80 percent of full capacity but still undergoing tests.”

The good news is that the software is designed to work around partial hardware failures. The bad news is that replacing the failed parts will likely require further shutdowns. ®


Original Source


A considerable amount of time and effort goes into maintaining this website, creating backend automation and creating new features and content for you to make actionable intelligence decisions. Everyone that supports the site helps enable new functionality.

If you like the site, please support us on “Patreon” or “Buy Me A Coffee” using the buttons below

To keep up to date follow us on the below channels.