In December 2020, the Projects and Infrastructure Modernization Division, Facilities Division, and Computing Sciences Area completed a multi-year project to increase power and capabilities for the National Energy Research Scientific Computing Center (NERSC). The project’s goal was to add the latest NERSC supercomputer, known as Perlmutter after Berkeley Lab’s own Nobel-prize-winning astrophysicist Saul Perlmutter. As a national user facility that supports the entire national lab network and over 7,000 users annually, NERSC’s scope and impact reaches far beyond Berkeley Lab. The project team that brought the project to completion had to balance complex infrastructure and electrical challenges with NERSC’s intensive schedule. Several members of that project team gathered to discuss the scope of the project, and how they managed the complexities and worked successfully together.
Participants
Jeff Broughton: Deputy for Operations, NERSC
Ben Maxwell: Building Infrastructure Group Lead, NERSC
David Topete: Project Manager, Project and Infrastructure Modernization (PIM) Division
Rey Espino: Construction Manager, PIM Division
Michael Lee: Acting Facilities Technical Manager, Facilities Division
Dan Williams: Divisional Electrical Safety Officer, Facilities Division
Carlos Escobedo: Facility Area Manager (FAM), Facilities Division
Strategic Communications: I’ve seen this project referred to as the N9 project and I’ve seen it referred to as the NERSC project. What was this project?
Jeff Broughton: The National Energy Research Scientific Computing Center (NERSC) under the Department of Energy (DOE) Advanced Scientific Computing Research (ASCR) program provides mission computational resources for DOE Office of Science sponsored projects. NERSC has been in existence for 47 years now, starting originally at Livermore [Lawrence Livermore National Lab] before moving to Berkeley in 1996.
The project itself has two parts. First was the actual procurement and acquisition of the computer system itself. And then the conventional facilities work, which we refer to as the NERSC 9 facilities upgrade, which was a project to add 12 and a half megawatts of power and cooling into building 59 to support NERSC 9 and the systems.
Strategic Communications: That’s a big scope of work. What did it take to build 12 and a half megawatts of power and cooling?
Ben Maxwell: In a nutshell it’s a construction project, like many others that the Lab has, just scaled up. You add the people and expertise necessary, and you have a team of designers, subject matter experts in their field, licensed engineers and architects. There’s a lot of DOE reviews to go through for the acquisition of major systems. Once you land the project on budget or under budget, you’re authorized to go to construction. That involves engaging a specialty construction contractor to manage and oversee the work and, slowly but surely, you buy equipment and install it, and slowly hook it up into the existing operating building. It takes a long time.
David Topete: With such a complex project on an existing operating facility, we had to make sure that there was enough communication, enough documentation, of everything going around, and make sure that people were aware of what’s happening.
Strategic Communications: How do you coordinate all of that? What did that look like?
Rey Espino: That’s the biggest part about this project, especially as it was in an operating building that Facilities was maintaining. We were adding a lot of electrical and a lot of cooling load to the building. So a lot had to be planned ahead of time and a lot had to be scheduled for Facilities to support the contractor to tie the new equipment to the existing building. This project was very heavy on the electrical portion.
Jeff: One of the trickiest parts of this entire project was that we were essentially doing a heart/lung transplant on an existing building. It was not only the project managers, construction managers, and the Facilities organization, the general contractor (XL Construction) and the individual electricians and mechanical techs who were involved.. It took the designers, the construction company, and even many of the folks at NERSC who actually run the systems, because we had to coordinate when they would take them down and how we would come back up. We would have to notify users so that we could avoid the impact to their work. It was a really extraordinary effort. What it took was just an enormous amount of upfront planning. And then, just a seamless execution once we actually did it.
Strategic Communications: How did you come to an agreement on what that looked like and how you were going to keep things moving forward correctly?
Ben: I think a lot of it is building on past experience. It was effective communication and a lot of meetings and documenting things and methods of procedure, so that all the steps are taken care of: timelines, handoffs, making decisions that affect downstream things. It’s literally just a lot of communication, good documentation. That is the root of the success.
Strategic Communications: For the Operations participants, how would you describe your support of this project?
Dan Williams: We’re operating at a 12,000 volt substation and dealing with the complexity of this system as it intertwines with the NERSC program. We were able to install this system successfully without incurring an injury, incident or lost time throughout the duration of the project… And that’s a real testament to Mike Lee and his team, of electricians, to Rey, and his ability to work with the contractor, and the quality of the Contractor themselves.. I, myself as one person don’t do anything, I support them and help them gather the information to review and make sure we’re all communicating effectively, before we start operating switches or opening up the gear or making our connections.
Michael Lee: I want to explain the type of work involved with a project. Typically you have a substation set up that provides power to a single building. In this case, there were multiple substations being brought up. In essence, it was as if several buildings were being brought online. It was a fantastic effort from Rey and his group. And what Dan spoke to, this all happened without a safety incident. That doesn’t happen as a result of poor planning or poor coordination.
Carlos Espino: Last year [2020], we had a lot of projects happening at the same time. As Facility Area Manager, I have to balance projects and their needs and help them meet their deadlines with also keeping the rest of the building tenants happy. It’s a challenge but we always learn things and good things come out of it.
Strategic Communications: For the science people in this roundtable, why is it important for operations partners to understand the work they are supporting?
Ben: I felt it was really important to make the contractor aware of what they were working on, what this facility (NERSC) does and the importance it holds at the federal level for our country. We had partnering sessions, which was a good opportunity to say, “Hey, you’re helping us expand this facility to be even better”. [Facilities teams are] already aware of that, but it’s always helpful to bring everyone together and just get everyone on the same page.
Jeff: I think that it’s sort of a given here at the Laboratory. One of the things that I have found talking to everybody who’s supporting it in one fashion or another is how much people are invested in the fact that they’re helping support outstanding world-class science. I would change the question around a little bit to “How important is it for the Operations folks to understand the, let’s call it the ‘peculiar needs’, of an individual science project”. What is your schedule? What are your cost constraints? How is headquarters looking at this project? We can improve in terms of the communication between operations and a [science] project and engage that upfront.
Strategic Communications: This was a complex, layered project. What did you learn about working together? What advice would you give about partnering operations and science in the future?
David: I’ll echo what others have said. It takes partnering, getting the stakeholders involved and letting them know what’s going to be needed when, and how often. We had a higher number of LOTO [lockout/tagout] requests submitted. If we know that we’re going to have fifty LOTO total to be reviewed, broadcast that early. Say, “Hey, we’re going to start this process roughly in June. So from June through October, we’re going to have a lot of stuff going on”. Try to get everybody on board early.
Rey: You need to be able to adapt and you need to be able to communicate the issues you’re running into and finding solutions. You just need to keep the ability to adapt and keep communicating with the people that are there to support you, is your best path forward for any type of major project.
Jeff: One final comment I would just say is the amount of coordination required is way more than you could imagine. One of the strong recommendations is that going forward, that we have a full-time LOTO coordinator on a project with this much electrical work in it. It’s something where it really takes a lot of work to make sure that you have all the moving parts come together and a project is not delayed.
Michael: I’ll throw one more thing in there. Everyone on this call and this project are absolutely among the best people I’ve ever worked alongside. I think it worked because of that, and the high level of understanding, commitment, and teamwork that this team showed.
Strategic Communications: Any final thoughts?
David: I have one. My whole approach has been to try to include everybody whom I believe should be included with whatever communications. Dan has a lot on his plate, but he’s been very, very helpful throughout the whole project, as well as Carlos. It’s been a team effort.
Michael: For departments, for people, for processes, this project represented our maximum output over the summer and fall. We were coming off of large projects like a maintenance outage, PSPS [public safety power shutoff], and then testing a system that was unknown to the lab as far as how it functioned. No one’s ever built something like this before. I don’t see how anyone on this call or anyone else involved peripherally could have done more to keep everything going.