R studio Machine learning


Book Chapters for Ideas/About-the-Editors_2018_Big-Data-Application-in-Power-Systems.pdf

About the Editors

Reza Arghandeh is an Assistant Professor in the ECE Department in Florida

State University. He is director of the Collaborative Intelligent Infrastructure Lab. He has been a postdoctoral scholar at the University of California,

Berkeley’s California Institute for Energy and Environment 2013–15. He has 5 years industrial experience in power and energy systems. He completed his PhD in Electrical Engineering with a specialization in power systems at Virginia

Tech. He holds Master’s degrees in Industrial and System Engineering from

Virginia Tech 2013 and in Energy Systems from the University of Manchester 2008. From 2011 to 2013, he was a power system software designer at Electrical

Distribution Design Inc. in Virginia. Dr. Arghandeh’s research interests include,

but are not limited to, data analysis and decision support for smart grids and smart cities using statistical inference, machine learning, information theory,

and operations research. He is a recipient of the Association of Energy Engineers

(AEE) Scholarship 2012, the UC Davis Green Tech Fellowship 2011, and the best paper award from the ASME 2012 Power Conference and IEEE PESGM

2015. He is the chair of the IEEE Task Force on Big Data Application for Power

Distribution Network.

Yuxun Zhou is currently a PhD candidate at Department of EECS, UC Berkeley.

Prior to that, he obtained the Diplome d’Ingenieur in applied mathematics

from Ecole Centrale Paris and a BS degree from Xi’an Jiaotong University. Yuxun has published more than 30 refereed articles, and has received several

student awards. His research interest is on machine learning theories and algo-

rithms for modern sensor rich, ubiquitously connected cyber-physical systems, including smart grid, power distribution networks, smart buildings, etc.


  • About the Editors

Book Chapters for Ideas/Acknowledgments_2018_Big-Data-Application-in-Power-Systems.pdf


The idea for this book goes back to a few years ago when we were analyzing

smart meters and SCADA data from some Californian electric utilities using dif- ferent machine learning and statistical inferences. Later on, we started to work

on phasormeasurement units (PMU) andmicro-PMU data streams which have

much more resolution than the smart meters. The PMU and power quality recording data (120 Hz to 30 kHz and beyond) plus highly spatial distributed

data from smart meters marked the advent of big data in power systems.

Utilities are already dealing with big data challenges considering the lack of knowledge in workforce and the lack of suitable infrastructure to handle and

process the massive data. We are sure that some of our readers have a similar

experience. On top of that, in the near future every house may have rooftop solar panels, controllable loads, smart appliances, electric vehicles, and various

software-enabled hardware that will be more connected in the era of Internet

of Things.

This book is a step toward data-driven utilities by presenting a combination of

the high-level view on utility enterprise architecture, data analysis methodol-

ogy, and various applications of data analytics in power transmission and distribution networks.

We have been lucky enough to have great maestros in our lives. Our parents Ali & Soodabeh Arghandeh and Yanping & Suxue Zhou, our advisers Prof. Robert

Broadwater and Prof. Saifur Rahman at Virginia Tech and Prof. Costas Spanos

and Prof. Alexandra von Meier at UC Berkeley.

In this book, we have a collection of highly recognized experts in academia

and industry in the field of power systems and data analysis from all around the world. We would like to thank them all for their outstanding contributions.

We would like to thank Dr. Heather Paudler for her valuable input on the

book. We extend special thanks to Renata R. Rodrigues and Ana C. A. Garcia from the Elsevier editorial team for their countless help and advice during

the different stages of preparation for this book. We also appreciate Honoka


xxvi Acknowledgments

Hamano’s efforts in designing the book cover, icons for each section, and

various other creative graphics inside the book.

Finally, we would like to thank several reviewers for valuable comments on

preliminary drafts of this book: Jeffrey S. Katz, Ricardo Bessa, JohnD.McDonald, Carol L. Stimmel, Mohammad Babakmehr, Elena Mocanu, Madeleine Gibescu,

Mehrdad Majidi, Gian Antonio Susto, Deepjyoti Deka, Fabio Rinaldi, Feng Gao,

Han Zou, Ming Jin, Ruoxi Jia, Yingchen Zhang, Behzad Najafi, Amin Hassanzadeh, Mihye Ahn, Hanif Livani, Matthias Stifter, Saverio Bolognani,

Michael Chertkov, Amirhessam Tahmassebi, Madhavi Konila Sriram, Roy Dong,

and Jose Cordova.

We look forward to hearing from our readership; please contact us with any

comments, suggestions, and questions.

Reza Arghandeh Florida State University, Tallahassee, FL, United States

Yuxun Zhou University of California, Berkeley, CA, United States

  • Acknowledgments

Book Chapters for Ideas/Chapter-10---Future-Trends-for-Big-Data-Applic_2018_Big-Data-Application-in-.pdf


Future Trends for Big Data Application in Power Systems

Ricardo J. Bessa INESC Technology and Science—INESC TEC, Porto, Portugal



The technological revolution in the electric power system sector is producing large volumes of data

with pertinent impact in the business and functional processes of systemoperators, generation com- panies, and grid users. Big data techniques can be applied to state estimation, forecasting, and con-

trol problems, as well as to support the participation of market agents in the electricity market. This

chapter presents a revision of the application of data mining techniques to these problems. Trends

like feature extraction/reduction and distributed learning are identified and discussed. The knowl- edge extracted from power system andmarket data has a significant impact in key performance indi-

cators, like operational efficiency (e.g., operating expenses), investment deferral, and quality of

supply. Furthermore, business models related to big data processing and mining are emerging and boosting new energy services.


The advent of Smart Grids with advances in information and communication technologies (ICT) and installation of new measurement devices, such as pha-

sor measurement unit (PMU) and remote terminal unit (RTU) in secondary

substations (MV/LV), allied to additional information collected by SCADA, will generate a large volume of data streams.

Equipment installed in MV/LV substations collects imported/exported active power, voltage magnitude, and reactive power in four quadrants, and a distri-

bution system operator (DSO) can easily operate more than 10,000 secondary

substations. In HV/MV substations, which can be more than 1000 in one DSO, additional data is collected through the SCADA, such as current, active and

reactive power flow in the network feeders, switcher and capacitor banks status,

as well as variables related to electric transformers (e.g., input/output voltage temperature, tap changer position, transformer oil level, insulation level of

transformer oil, load). This high volume of grid data has different constraints

in terms of communications’ latency and availability. For instance, significant technical and economic constraints are expected in the real-time communica-

tion between smart meters and secondary substation, which requires new

Big Data Application in Power Systems. https://doi.org/10.1016/B978-0-12-811968-6.00010-3

Copyright © 2018 Elsevier Inc. All rights reserved.

224 CHAPTER 10: Future Trends for Big Data Application in Power Systems

approaches for the real-time monitoring of low voltage (LV) networks. More-

over, the time resolution collected by different equipment differs, PMU collects high-frequency data, while RTU, in general, collects low-frequency data (e.g.,

15-min average).

PMU can provide high-update rate data to a transmission systemoperator (TSO).

For instance, the Texas Synchrophasor Network collects 30 measurements per

second from each PMU (e.g., voltage/current magnitude and phase, frequency), which means 108,000 lines of comma-separated data per hour and 2.6 million

lines for a 24 h’ period; for 15 PMUs, file storage is about 1 GB per day [1].

This data, collected at different voltage levels, is essential to revisit classical TSO and DSO grid management functions, such as forecasting, state estimation,

operational planning, and develop new tools to increase real-time awareness

of operators and design predictive maintenance strategies for network components.

The renewable energy sources (RES) industry is also installing and operating monitoring sensors at the wind turbine and photovoltaic panel level, which

generates a large volume of data that needs to be preprocessed and analyzed

in realtime and transferred to upstream decision centers. For instance, a 2.5 MWwind turbine hasmore than 120 sensors inside the rotor, the generator,

and on the blades, which gather 10,000 of data points every second. They feed

the information to a remote database, which stores 4 TB from 25,000 turbines around the world.1 The same is valid for gas turbine engine that generates

520 GB per day, in contrast to Twitter where a day of real-time feeds represents

around 80 GB.2 This data can be used for reliability and performance monitor- ing, predictive maintenance, and asset management of conventional and RES

power plants. Eventually, the outcome of the data analysis at the power plant

level can feed power system reliability assessment tools [2], by providing, for instance, data-driven time-varying failure rates.

In addition to all these electrical and mechanical variables, there are also exog- enous variables with significant impact on the power system and power plants

operation and planning, such as measured and predicted weather variables

(e.g., wind speed, temperature, and solar irradiance) that can form a grid of spatial-temporal weather information in a region and/or country.

Electricity markets are already generating large volumes of data like offers

curves (per unit) in different sessions, energy and ancillary services prices, as

1 Source: http://www.gereports.com/post/118712460090/move-over-slow-food-slow-wind-might-be-

the-latest/ (accessed on October 2016). 2 Source: http://www.computerweekly.com/news/2240176248/GE-uses-big-data-to-power-machine-

services-business (accessed on October 2016).

2252 Transmission System

well as locational marginal prices (LMP) for each node of the transmission net-

work. The foreseen creation of flexibility markets at the distribution level will increase the volume of data and its spatial scale. The planned investment in

interconnection capacity between different control areas, and the increase inte-

gration of RES in power systems with LMP,makes spatial-temporalmodeling of large-scale time series vital for operational and planning purposes. Therefore,

knowledge extraction from big data can create additional value for both market

players and system operators.

All these problems require different layers of data handling: (i) data acquisition

and transmission; (ii) data management (e.g., frameworks like Hadoop or Spark); (iii) data analytics, which can comprise knowledge extraction from data,

optimization, and decision-aid methods. The first two layers already achieved a

high-technology readiness level, with different solutions available in the market [3,4]. However, standardization of the data model, ICT for real-time data trans-

mission, and cybersecurity issues remain areas of significant improvement.

The scope of this chapter is the big data analytics layers and the overall objective

is to discuss the main challenges related to knowledge extraction in different

power system-related problems and cover new (and evolving) problems, such as distributed learning and optimization, spatial-temporal modeling of time

series, data reduction, assimilation, and visualization methods. The entire elec-

tric power system is covered, going from Extra HV to LV, without overlooking the wholesale and retailing electricity market.

This chapter is organized as follows: Section 2 describes the data-driven tech-

niques for dynamic and steady-state analysis of transmission systems, as well as the interaction between transmission and distribution system operators;

in Section 3, the additional monitoring and control capabilities provided by

advanced data mining techniques are discussed in a Smart Grids context; Section 4 discusses the knowledge extraction from failure data to support asset

management strategies of system operators and generation companies; the

added value of big data techniques for electricity market bidding and simula- tion is discussed in Section 5, while Section 6 discusses its application to boost

demand-side flexibility. The conclusions are presented in Section 7.


At the transmission system level, the increasing penetration of RES is demand- ing for new monitoring and management tools for both interconnected and

isolated systems. A new generation of decision-aid tools will supply the oper-

ator with valuable information to check the security level of the economic dis- patch and/or electricity market-clearing, considering RES variability and

226 CHAPTER 10: Future Trends for Big Data Application in Power Systems

uncertainty, as well as to increase the real-time awareness and derive recom-

mendations to support preventive decisions.

2.1 Dynamic Behavior Analysis

The installation of PMU in different voltage levels generates important infor-

mation to warn operators and system level controllers about impending tran- sient stability issues, support their preventive decisions, and perform

postmortem analysis. The California independent system operator (CAISO)

defined use cases that describe the inclusion of PMU data for grid operations, control and modeling tasks [5]. The use cases identified seven scenarios to

demonstrate the value of PMU data:

1. The PMU network triggers an alarm (e.g., rate of frequency change,

modes of oscillation, rate of damping) for a recommendation system that

generates a set of control actions for the operator. 2. Measure the frequency difference between main and isolated grids for

system restoration after a disturbance and determine how much

generation must be changed to reconnect the separated grids. 3. Postmortem analysis of system events to understand the causes of

disturbance, which is used to validate offline dynamic models and contingency simulation tools.

4. Validation of gridcode and market models for new types of resources,

such as RES and storage. 5. Detect transient instability and derive preventive control actions that can

respond to specific or wind-area grid problems, e.g., angular and voltage

stability, low-frequency oscillations. 6. Identify poorly damped interarea oscillations and design smart control

actions to mitigate the oscillations, e.g., use PMU to tune power system

stabilizers. 7. Increase the line ratingof transmission lines in realtime. ThePMUdata can

detect postcontingency technical problems and activate the preventive

control actions from scenario (5) to mitigate in realtime the violations by reconfiguring the system (e.g., increase generation or decrease load).

The electric power research institute (EPRI) identifies the following applica- tions for PMU data [6]: (i) improvement of state estimation; (ii) oscillation

detection and control; (iii) voltage stability monitoring and control;

(iv) load model validation; (v) system restoration and event analysis.

It should be stressed that the use of PMU data demands for a portfolio of dif-

ferent tools at the control center level, which corresponds to the enhancement of classical functions and to the development of new functions. Examples of

related tools are the state estimator, voltage stability analysis, volt/Var control,

and RES dispatch. A PMU network combined with decision trees can be used to

2272 Transmission System

match the generator trips signature with the overall system dynamic, aiming at

finding the most likely location of an event in realtime [7]. The data processing and machine learning fitting were performed offline and in a controlled envi-

ronment since the training consisted of 53 events that match known generator

trips. An industrialization of this solution would require machine learning algorithms for classification problems able to cope with high-speed data

streams and detect concept drift [8].

Other potential applications are: line trip detection that requires postprocessing

methods, such as a low-pass filter to remove high-frequency noise and a second

one to get the trend of frequency data [9]; online prediction of transient stability (i.e., three phase faults at different buses) with decision tree algorithm

in order to derive corrective control rules [10].

The seemly integration of PMU in power system operational tools will require a data analytics platform that integrates batch, real-time, and iterative data

processing. Apache Spark is emerging as the cluster computing platform for future power systems [11]. The trend is toward distributed computing for data

collection and analytics. However, there is the need to develop algorithms that

are parallelizable to distribute the computational load across multiple nodes [12].

Furthermore, this efficient computational framework does not waive the appli-

cation of data reduction and compression techniques, which should be flexible to the different operating conditions, e.g., compress less data under disturbance

conditions [13]. Classical techniques, such as principal component analysis

and discrete wavelet transform, can be extended to this problem to have time-varying (potentially combined with change detection) and situational-

dependent characteristics. Clustering algorithms can be also used to group

the dynamic response of generators (i.e., transient responses of generator rotor angles) and use a classification algorithm to forecast the dynamic signature of a

system using a dataset of postdisturbance responses [14].

Failure in communication creates missing values in the power system dynamic

response. The state of the art consists in using the linear auto-regressive with

exogenous input model to estimate system dynamics, together with an input location selection methodology based on a coherency function [15]. The spatial

and temporal dependencies between the system variables can be further

exploited with the different families of covariance functions associated to Gauss- ian processes theory and improve the missing values estimation tasks [16].

Machine learning algorithms can be also used to give a real-time quantitative security evaluation of the current operating state system (i.e., expected fre-

quency deviation) based on historical states and observations of the power sys-

tem variables [17]. This research line was further explored in microgrids and isolated systems [18].

228 CHAPTER 10: Future Trends for Big Data Application in Power Systems

2.2 Steady-State Analysis

The tools for steady-state analysis of power systems, such as power flow and state estimation algorithms, reached a high-technological readiness level and several

commercial solutions are already available. The current challenge is to integrate

new and diverse types of information in these classical algorithms, capture the spatial-temporal structure of variables dependency, while guaranteeing a high


Past development in state estimation algorithms already included information

from load forecasts to predict the future states of the power systems. For

instance, modeling the dependency between nodal injections forecast errors with a covariance matrix [19,20]. The load forecast and state estimation theo-

ries can be merged to forecast the future values of the power system state var-

iable (bus voltage magnitude and phase) and then calculate the load values as a function of the state parameters [21]. This new load forecast paradigm enables

the use of additional data, such as voltage phase from PMU or electrical vari-

ables collected from multiarea networks, and the construction of local forecast models for different subnetworks.

However, the modeling of spatial-temporal dependencies is indispensable and requires a method suitable for a large-scale implementation. Gaussian copulas

can be employed to model the spatial-temporal dependency structure between

random variables [22], but have two limitations: (i) lack of flexibility in model- ing different types of tail’s dependency; (ii) low scalability when the number of

random variables increases.

The effect of RES and load uncertainty (and variability) in state estimation, together with frequent topological changes, leads to significant state shift in

power system operation. This problem can be mitigated by developing data-

driven solutions, instead of using single data point (last state estimation). Kernel ridge regression with a Bayesian framework that uses historical data

collected by the energy management system can tackle this problem [23].

Another relevant trend is the use of distributed learning approaches for robust

state estimation that results in minimum data exchanges between neighboring

areas [24], mitigates privacy issues, and can run locally in grid equipment. This distributed learning paradigm relies in the alternating directionmethod ofmul-

tipliers (ADMM) that combines the decomposability offered by the dual ascent

method with the superior convergence properties of the method of multipliers, which means that problems with nondifferentiable objective functions can be

easily addressed and it is possible to perform parallel optimization [25]. It is

also possible to apply other variants, such as the Douglas-Rachford and block coordinate descent methods [26,27]. It is important to stress the nonlinear

nature of the AC power system, which results in a nonconvex problem for

the state estimator.

2292 Transmission System

The same paradigm can be applied to RES forecast to explore geographically

distributed time-series information [28]. The vector autoregression (VAR) framework can be applied to forecast thousands of time series in a distributed

fashion by combining ADMMwith LASSO framework to explore the sparsity in

the model’s coefficients.

The practical implementation of the distributed learning paradigm requires an

adequate choice of the distributed processing platform, which can be divided into two types [29]: (i) horizontal scaling: distribute the workload by several

servers—decentralized and distributed cluster (cloud) computing framework;

(ii) vertical scaling: involves installing more processors, memory, and faster hardware inside a single machine.

For horizontal scaling, message passing interface (MPI) was the first communi-

cation protocol to distribute and exchange the data between peers, Apache Hadoop with MapReduce as the data processing scheme emerged later, and

Apache Spark is the prevalent solution. For iterative algorithms like ADMM, MapReduce is not adequate due to disk I/O limitations, while Spark performs

in-memory computations that overcome these limitations for iterative pro-

cesses [29]. The most popular vertical scale up technologies are high- performance computing clusters, multicore processors, and graphics processing

unit (GPU). The ADDM algorithm and variants can be implemented in these


2.3 TSO-DSO Cooperation

The data exchange between TSO and DSO will contribute to increase the secu-

rity of both systems in different time-scales, ranging from real-time to long-term planning. The European project evolvDSO developed a usecase for the TSO-

DSO cooperation, which firstly means bidirectional exchange of information,

both historical and real-time data, regarding the operating conditions of the transmission and distribution systems [30]. Secondly, it can also mean the

DSO supporting the TSO operational and planning tasks, for instance, by con- trolling the active and reactive power in the primary substation or elaborating a

joint expansion plan of both systems. Cooperation is needed since presently

the distribution system is a blackbox to the TSO and viceversa. Moreover, con- sidering the increasing integration of distributed energy resources in the distri-

bution system, the operation of both networks becomes challenging and

cannot be decoupled. The new flexible resources (e.g., demand response— DR) are also at the distribution system level, which requires new TSO-DSO

technical protocols for its activation and management.

This increasing cooperation will mean additional data to be integrated and explored in the managing tasks of both TSO and DSO. One trend is the

230 CHAPTER 10: Future Trends for Big Data Application in Power Systems

development of tools capable of estimating the flexibility range of active and

reactive power in the TSO-DSO boundary and separating this flexibility by total cost [31]. The same exercise can be conducted for lower voltage levels

of the power system [32].

For dynamic analysis, the trend is to estimate the dynamic response of load

aggregated at the network node level for a time domain between one and sev-

eral seconds. One example is probabilistic methodologies based on processing and classifying large amounts of historical load data at each bus and standard

dynamic signatures of individual load categories obtained from laboratory/

fieldtests [33]. Another is dynamic equivalent models constructed for the dis- tribution networks that are able to reflect the aggregated behavior of different

resources with respect to system requirements such as frequency containment

reserve. Machine learning algorithms, such as artificial neural networks, can be used as surrogate models for the dynamic equivalents [34].


The big data trends in the distribution system are mainly driven by two objec-

tives. Firstly, increase the monitoring capability of MV and LV networks and

develop fast decision-aid methods for operators. Secondly, implemented pre- dictive active management strategies that take advantage of flexibility from

distributed energy resources to mitigate the impact of RES uncertainty and


3.1 Monitoring and Situational Awareness

The smart grid paradigm increases themonitoring capability of the distribution

system. However, it might be unmanageable to have real-timemonitoring of all the devices in the distribution system, particularly at the LV level. Machine

learning algorithms installed in intelligent electronic devices can support

power system monitoring by providing several functionalities, such as recon- struction of missing signals, state estimation, asset monitoring and diagnosing,

and fault location. These functions should have low computational require-

ments (e.g., no need to store data, capacity of running in low cost processors) and the possibility to adjust under evolving conditions.

For LV grids, the trend is to explore data collected from smart meters and RTU installed in MV/LV substations for close to real-time situational awareness of

operators and with low communication costs. Smart meter data can be used

to increase the knowledge about the LV network topology and characteristics. For instance, it can be used to reduce geographical information system errors

(e.g., connectivity errors in the network topology) and for phase detection [35].

2313 Distribution System

Data-driven methods, such as autoencoder extreme learning machines

(AE-ELM), can be employed to estimate, close to real-time, voltage magnitude and active power for all nodes of the LV network by using only a subset of

meters with real-time communication capability [36,37]. This new smart grid

function can generate under/overvoltage alarms to operators and trigger con- trol management functions to solve the technical problems. These techniques

provide accurate information about voltage magnitude. Only with 30% of the

total meters with real-time communication, the AE-ELM state estimator esti- mates [38]: (i) voltage magnitude values with a mean absolute error (MAE)

of 0.49 V; (ii) active power quantities with an MAE of 0.35 kW. The largest

MAE was 0.79 V.

The challenge is on how to monitor the operating conditions of multiple LV

networks at the same time and derive control strategies to solve detected tech- nical problems. This problem requires new techniques for data streaming visu-

alization and dimension reduction that summarize the operating conditions of

each …