Table of Contents
In the world of time series analysis, most attention is usually given to filling missing values or modeling trends.
However, one of the most dangerous — and often overlooked — elements is the presence of outliers: extreme values that distort means, inflate variance, and sabotage forecasts.
This article presents a simple, elegant, and effective approach to correcting such anomalies using Python, combining the statistical power of the Z-Score with the smoothness of linear interpolation.
This technique is essential for ensuring robustness before applying models such as ARIMA, Prophet, or LSTM.
Why Use This Technique?

Ignoring outliers in time series can cause false alarms, artificial trends, and inaccurate forecasts. This directly affects financial, logistical, and operational decisions. By using Z-Score, we apply a statistical detection based on standard deviation. Meanwhile, linear interpolation ensures that the correction preserves the coherence of the temporal sequence, without introducing noise or breaking the trend. It's like removing static from an instrument before tuning its melody — a silent but essential preparation for analytical harmony.
This article aims to present a practical and efficient approach to detect and correct such extreme values, using Z-Score as a statistical criterion and linear interpolation as the method to reconstruct the series. The combination of these two techniques restores data integrity with simplicity, preserving trend and reducing noise that distorts forecasting. The content is designed for data scientists, financial analysts, BI professionals, and anyone who works with time series and seeks to make them more reliable for analytical decision-making.
Technical and Mathematical Overview
The Z-Score is a statistical measure that represents the number of standard deviations a value is from the sample mean. The formula:
where:
• x: observed value
• μ: series mean
• σ: standard deviation
Values with |Z| > 2.5 (or 3.0, depending on strictness) are treated as outliers. These points are replaced by NaN and later reconstructed using linear interpolation:
xt=xt−1+xt+1−xt−12x_t = x_{t-1} + \frac{x_{t+1} - x_{t-1}}{2}
This interpolation preserves the natural rhythm of the series, avoiding visual or statistical ruptures.
The Z-Score, also known as standardized score, measures how far a value is from the mean of the series in standard deviation units. Its formula quantifies whether a point is significantly distant from the average. Values with |Z| > 2.5 are considered outliers in many real-world datasets.
Linear interpolation replaces missing values by connecting the previous and next point with a straight line. This ensures a smooth transition in the curve, respecting the temporal pattern without generating artificial jumps.
This technique is extremely useful in preparing data for sensitive algorithms such as ARIMA, Prophet, and LSTM, as these models assume that the series behaves regularly and continuously over time.
Removing outliers improves accuracy, reduces bias, and avoids false patterns that negatively impact statistical and machine learning modeling.
Choosing a Strategy: Manual or Automated?
The manual approach, line by line, is ideal for educational purposes and for maintaining fine control over the treatment of the series. The automated approach, however, can be easily implemented in pipelines, integrating both outlier detection and interpolation functions. In this article, we show both approaches integrated into a single script — ready to run in VSCode, striking the balance between control and productivity.
Realistic Example: Practical Scenario

Imagine a retail company analyzing its monthly sales. In a particular month, a flash promotion doubled the sales — but the data wasn’t labeled as a special event. This point then behaves as an outlier, negatively influencing future forecasts. Our script simulates exactly this scenario: monthly data with extreme points that need to be smoothed before feeding predictive models.
Complete Python Script with Comments
Below is the robust script, validated in VSCode and with line-by-line technical comments.
It performs the entire process of:
• 📥 Reading historical monthly sales data from a CSV file;
• 🧹 Preparing the time series with monthly frequency and handling missing values;
• 🚨 Automatically detecting outliers using statistical Z-Score;
• 🔄 Correcting extreme values through linear interpolation;
• 📊 Comparative graphical visualization with export of the final image as .png.
⚠️Before running the script, you need to have a file named meus_dados.csv in the same directory.
It should contain the columns Data (monthly dates) and Vendas (numeric values).
Main Script – corrigir_outliers.py
# corrigir_outliers.py
# Author: Izairton Vasconcelos
# Purpose: Detect and correct outliers in time series using Z-Score + Linear Interpolation
# Compatible with Python 3.8+ | Clean code, tested in VSCode, no Pylance warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore
# ============================================================
# Step 1 – Load the dataset (CSV with monthly sales)
# ============================================================
# Expects a file named 'meus_dados.csv' with the columns: 'Data' and 'Vendas'
# The 'Data' column will be used as the time index
df = pd.read_csv("meus_dados.csv", parse_dates=["Data"], index_col="Data")
# ============================================================
# Step 2 – Prepare the time series
# ============================================================
# Convert the series to explicit monthly frequency ("ME") and fill missing values with forward fill
serie = df["Vendas"].asfreq("ME")
serie = serie.ffill() # Separated to avoid overload issues in Pylance
# ============================================================
# Step 3 – Calculate Z-Score and identify outliers
# ============================================================
# Convert the series to a NumPy array to avoid static typing errors
z_scores = zscore(serie.values)
z_scores = np.asarray(z_scores) # Ensure explicit type
# Define as outliers the points with absolute score greater than 2.5
outliers = np.abs(z_scores) > 2.5
# ============================================================
# Step 4 – Correct outliers using linear interpolation
# ============================================================
# Create a copy of the original series to preserve data integrity
serie_corrigida = serie.copy()
# Replace outliers with null values (NaN) to be interpolated
serie_corrigida[outliers] = np.nan
# Apply linear interpolation between previous and next points
serie_corrigida = serie_corrigida.interpolate(method="linear")
# ============================================================
# Step 5 – Comparative graphical visualization of the series
# ============================================================
plt.figure(figsize=(12, 6))
# Plot the original series with dashed blue style and markers
plt.plot(serie, label="Original", linestyle='--', marker='o', alpha=0.5)
# Plot the corrected series with solid green line
plt.plot(serie_corrigida, label="Corrected", linestyle='-', color='green')
# Highlight detected outliers with red dots
plt.scatter(serie.index[outliers], serie[outliers], color='red', label="Outliers", zorder=5)
# Set title and axis labels
plt.title("📊 Time Series – Outlier Correction")
plt.xlabel("Date")
plt.ylabel("Sales")
# Enable background grid for easier graph reading
plt.grid(True)
# Display legend with visual elements
plt.legend()
# Auto-adjust layout to prevent clipping
plt.tight_layout()
# Save the graph as PNG image for documentation and reports
plt.savefig("serie_corrigida_outliers.png")
# Display the interactive graph in VSCode or Jupyter environment
plt.show()
Helper Script – gerar_csv_exemplo.py
Run this script once to create the file meus_dados.csv with fictitious data — ideal for testing the above logic.
# gerar_csv_exemple.py
# Generates a synthetic CSV with monthly sales and two intentional outliers
import pandas as pd
import numpy as np
# Generate 36 monthly dates (Jan/2021 to Dec/2023)
dates = pd.date_range(start="2021-01-01", periods=36, freq="ME")
# Create simulated series with mean 210 and random noise
sales = np.random.normal(loc=210, scale=15, size=36).round()
# Insert two outliers (anomalous values)
sales[8] = 300 # anomalous peak
sales[24] = 120 # anomalous drop
# Build the DataFrame and export as CSV
df = pd.DataFrame({"Data": dates, "Vendas": sales})
df.to_csv("meus_dados.csv", index=False)
print("✅ File 'meus_dados.csv' created successfully!")
Generated Images: Visual Analysis and Statistical Interpretation
Graph Visualization

The generated graph provides a clear and educational composition for analyzing a time series with outliers. The dashed blue line represents the original data, including two clearly visible extreme points: an abnormal peak of 300 units in 2021-09 and a sharp drop to around 120 in 2023-01. These values are marked with red dots on the graph, resulting from the statistical application of the Z-Score, using the |Z| > 2.5 criterion.
The "magic" of the script happens when these extreme points are transformed into NaN and smoothed through linear interpolation. As a result, the new solid green line emerges clean, stable, and realistic, flowing smoothly between neighboring points and preserving the series' pattern. This correction is done without breaking the time structure or introducing visual artifacts, reinforcing interpolation as a highly effective technique for preparing time series data with statistical distortions.
Chart Analysis: Outlier Correction in Time Series
1. Dashed Blue Line – Original Series
📌Code line:
plt.plot(serie, label="Original", linestyle='--', marker='o', alpha=0.5)
This line represents the original sales time series, including all values — even the outliers. The blue markers with dashed strokes indicate that it is raw data that has not yet undergone any statistical treatment. Visually, you can notice abrupt oscillations that do not align with the general pattern of the series.
2. Red Dots – Detected Outliers
📌Code line:
plt.scatter(serie.index[outliers], serie[outliers], color='red', label="Outliers", zorder=5)
These dots mark the values considered statistically abnormal, with absolute Z-Score greater than 2.5. In the graph, they are highlighted in red to make anomaly detection easier. They indicate regions where variation was abrupt and unjustified, such as sudden peaks or isolated drops.
3. Solid Green Line – Corrected Series
📌Code involved:
serie_corrigida = serie.copy()
serie_corrigida[outliers] = np.nan
serie_corrigida = serie_corrigida.interpolate(method="linear")
plt.plot(serie_corrigida, label="Corrected", linestyle='-', color='green')
This line shows the smoothed series after outlier removal and interpolation. The linear method ensures continuity and smoothness, creating a more stable curve that aligns with expected behavior. It is ideal for predictive analysis, as it eliminates disruptive points that could harm the modeling process.
4. Title and Axes
📌Code lines:
plt.title("📊Time Series – Outlier Correction")
plt.xlabel("Date")
plt.ylabel("Sales")
The X-axis represents time (monthly dates), while the Y-axis shows the sales values. The title makes it clear that the focus of the visualization is on correcting outliers, visually highlighting the before and after of the smoothing process.
5. Legend
📌Code line:
plt.legend()
The legend facilitates interpretation by distinguishing the three visual layers:
- Dashed blue: original series with outliers;
- Red: statistically detected anomalies;
- Solid green: corrected and interpolated series.
Visual Output and the Magic of the Script
The magic of the script becomes visually evident when the distorted points of the original series are no longer connected by the green curve. Instead of jumps or gaps, we see a line that travels through the empty space left by the outlier, smoothly connecting the previous and next points — that is linear interpolation in action.
The turning point lies in the combination of robust statistics (Z-Score) to find outliers and interpolation engineering that reconstructs them. The graph shows that, by removing the value 300 (positive outlier) and 120 (negative outlier), the new series not only becomes more continuous but also more interpretable from an analytical point of view.
This "qualitative leap" in the curve — from disorder to coherence — is what we call implicit learning of the pipeline. The image does not merely display a corrected series; it illustrates a smarter, more modelable, and reliable one.
Case Study: Connecting Code to Business Practice
Based on the generated graph, it is clear how the original time series displayed abrupt variations, with isolated peaks or drops that could negatively influence any forecasting model. In the simulated company scenario, such outliers might result from data entry errors, system logging failures, or one-time events not explained in the raw dataset. The applied interpolation smooths out these distortions, producing a clean and continuous series — ideal for feeding models like ARIMA, Prophet, or LSTM. The final visualization shows that the correction was performed precisely: extreme points were neutralized and replaced with values that follow the overall trajectory of the curve. This allows models to learn from real and recurring patterns rather than anomalies that distort training. In business environments, such care is vital to ensure reliable forecasts and highly credible management reports.
Line-by-Line Explanation
Each line of the script contributes to preserving the integrity of the analysis. We begin with loading the CSV (pd.read_csv()), where the dataset is imported and the date column is immediately converted to a time index. Then, asfreq("ME") standardizes the frequency to monthly, and fillna(method="ffill") addresses potential missing data issues early. The zscore() function computes the standardized score for each point in the series, allowing for statistical identification of outliers — in this case, using a threshold of 2.5 standard deviations. After identifying them via np.abs(z_scores) > 2.5, we replace those values with np.nan, preserving the structure of the time series. Interpolation is then performed using interpolate(method="linear"), connecting previous and next values smoothly. Finally, matplotlib.pyplot is used to generate a detailed chart, with clearly defined visual layers and image export. The entire flow is cohesive, modular, and ready for integration into real-world projects.
Strategic Interpretation
When observing the corrected time series, there is a clear reduction in noise and smoother progression over time. The removal of outliers prevents analytical models from learning unrealistic behaviors such as sporadic peaks or artificial drops. This statistical cleansing is crucial in business contexts, where decisions rely on accurate forecasts. Without such treatment, the model may exaggerate projections or signal non-existent risks. Linear interpolation, besides preserving signal continuity, avoids discarding entire data points, thus maximizing the value of each observation. The strategy is effective because it combines statistical precision (Z-Score) with analytical intelligence (interpolation), forming a robust and replicable preprocessing pipeline.
Real-World Applications in Business and AI

This technique is widely applicable to real-world scenarios. In sales and marketing, it can correct distortions caused by one-off promotional events. In logistics, it removes delivery spikes caused by operational failures. In finance, it ensures that atypical dividends don’t distort profit series. In manufacturing, it corrects inaccurate production records or IoT sensor failures. In AI-driven data science, it enhances the reliability of datasets used to train recurrent neural networks (RNNs), such as LSTM or GRU. In MLOps pipelines, outlier treatment is an essential preprocessing step, ensuring that model inputs are clean and that predictions are based on trustworthy patterns.
Advanced Insights and Strategic Suggestions
For larger projects, it is recommended to encapsulate this entire process into customized functions and integrate it into a full transformation pipeline using libraries such as sklearn.pipeline or dataflow frameworks like Prefect, Dagster, or Airflow. It is also possible to dynamically adjust the Z-Score threshold based on local variance using rolling windows. The interpolation process can be enhanced with more sophisticated methods such as spline interpolation, polynomial fitting, or local regression (LOESS). In highly regulated environments such as banking and insurance, it is advisable to log all removed points and generate detailed audit trails explaining the correction logic. Additionally, this technique can be combined with clustering models to differentiate between structural outliers (regime shifts) and purely statistical anomalies (isolated points).
Technical Conclusion

Handling outliers in time series is more than just a good practice — it is a fundamental requirement for building solid predictive models. The combined technique of Z-Score detection and linear interpolation enables the identification of extreme values based on rigorous statistical thresholds and replaces them naturally, preserving the flow and continuity of the series. The result is a dataset that is cleaner, more coherent, and more stable — all essential characteristics for serious analysis. This approach is simple enough to be deployed in small-scale experiments and robust enough to scale in enterprise-level environments. By applying this method, the data professional demonstrates mastery of statistical foundations, accountability in data handling, and practical awareness of modeling needs.
References and Further Reading
- SciPy Documentation – Z-Score
Official documentation of the scipy.stats.zscore() function, used to standardize data and detect outliers based on standard deviation.
🔗https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html - Pandas Documentation – Interpolation
Reference for the .interpolate() method, which provides linear and advanced interpolation methods to fill missing or altered time series data.
🔗https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html - Forecasting: Principles and Practice – Hyndman & Athanasopoulos
Renowned textbook on time series forecasting, with dedicated sections on dealing with anomalies and data quality issues.
🔗https://otexts.com/fpp3/ - Kaggle – Datasets with Outliers in Time Series
Community-contributed datasets for experimentation, training, and validation of time series cleaning pipelines involving outlier detection.
🔗https://www.kaggle.com/datasets?search=outliers+time+series
Follow & Connect
Izairton Vasconcelos is a technical content creator with degrees in Software Engineering, Business Administration, and Statistics, as well as several specializations in Technology.
He is a Computer Science student and Python specialist focused on productivity, automation, finance, and data science. He develops scripts, dashboards, predictive models, and applications, and also provides consulting services for companies and professionals seeking to implement smart digital solutions in their businesses.
Currently, he publishes bilingual articles and tips in his LinkedIn Newsletter, helping people from different fields apply Python in a practical, fast, and scalable way to their daily tasks.
💼 LinkedIn & Newsletters:
👉https://www.linkedin.com/in/izairton-oliveira-de-vasconcelos-a1916351/
👉https://www.linkedin.com/newsletters/scripts-em-python-produtividad-7287106727202742273
👉https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7319069038595268608
💼 Company Page:
👉https://www.linkedin.com/company/106356348/
💻 GitHub:
👉https://github.com/IOVASCON