Sentiment Analysis of News Articles: Building an ML Pipeline

Sentiment analysis of news articles enables organizations to track public perception, monitor brand reputation, and identify emerging trends. This guide walks through building a production-ready sentiment analysis pipeline.

> Key Takeaways > > - Pre-trained transformer models like DistilBERT and FinBERT provide strong baselines for news sentiment with 85-95% accuracy > - Entity-level sentiment analysis lets you attribute positive or negative sentiment to specific companies, people, or topics within an article > - A complete pipeline includes data collection, preprocessing, model inference, aggregation, and visualization > - Serverless deployment on AWS Lambda enables cost-effective, scalable real-time sentiment processing

What Are the Key Use Cases for News Sentiment Analysis?

News sentiment analysis uses NLP to classify articles as positive, negative, or neutral, enabling use cases from brand monitoring and market intelligence to crisis detection and competitive analysis.

Brand Monitoring: Track how your company is portrayed in media
Market Intelligence: Gauge market sentiment for investment decisions
Public Relations: Monitor crisis situations in real-time
Competitive Analysis: Understand perception of competitors
Trend Detection: Identify emerging topics and sentiments

According to Grand View Research, the global sentiment analytics market was valued at $4.4 billion in 2024 and is projected to grow at a CAGR of 14.2% through 2030, driven by increasing demand for real-time media monitoring and customer insight. Organizations across finance, healthcare, and technology increasingly rely on custom AI development to build sentiment pipelines tailored to their specific domains.

Pipeline Architecture

News Sources
    │
    ▼
┌─────────────────┐
│ Data Collection │
│  (RSS, APIs)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Text Extraction │
│   & Cleaning    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Sentiment    │
│    Analysis     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Aggregation   │
│   & Storage     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Visualization  │
│   & Alerting    │
└─────────────────┘

Data Collection

News API Integration

import requests
from datetime import datetime, timedelta
class NewsCollector:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://newsapi.org/v2"
def get_articles(self, query, days_back=7):
        """Fetch news articles for a query."""
        from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
response = requests.get(
            f"{self.base_url}/everything",
            params={
                "q": query,
                "from": from_date,
                "sortBy": "publishedAt",
                "language": "en",
                "apiKey": self.api_key
            }
        )
if response.status_code == 200:
            return response.json().get("articles", [])
        return []
Usage
collector = NewsCollector(api_key="your_api_key")
articles = collector.get_articles("artificial intelligence")

RSS Feed Processing

import feedparser
from concurrent.futures import ThreadPoolExecutor
class RSSCollector:
    def __init__(self, feed_urls):
        self.feed_urls = feed_urls
def parse_feed(self, url):
        """Parse a single RSS feed."""
        feed = feedparser.parse(url)
        articles = []
for entry in feed.entries:
            articles.append({
                "title": entry.get("title", ""),
                "summary": entry.get("summary", ""),
                "link": entry.get("link", ""),
                "published": entry.get("published", ""),
                "source": feed.feed.get("title", url)
            })
return articles
def collect_all(self):
        """Collect from all feeds concurrently."""
        with ThreadPoolExecutor(max_workers=10) as executor:
            results = executor.map(self.parse_feed, self.feed_urls)
return [article for articles in results for article in articles]

Text Preprocessing

Cleaning and Normalization

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
class TextPreprocessor:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
def clean_text(self, text):
        """Clean and normalize text."""
        # Convert to lowercase
        text = text.lower()
# Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text)
# Remove HTML tags
        text = re.sub(r'<.?>', '', text)

# Remove special characters (keep letters and spaces)
        text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
        text = ' '.join(text.split())
return text
def tokenize_and_lemmatize(self, text):
        """Tokenize and lemmatize text."""
        tokens = word_tokenize(text)
# Remove stopwords and lemmatize
        tokens = [
            self.lemmatizer.lemmatize(token)
            for token in tokens
            if token not in self.stop_words and len(token) > 2
        ]
return tokens
def process(self, text):
        """Full preprocessing pipeline."""
        cleaned = self.clean_text(text)
        tokens = self.tokenize_and_lemmatize(cleaned)
        return ' '.join(tokens)

How Do You Choose the Right Sentiment Analysis Model?
Model selection depends on your domain: general-purpose transformers like DistilBERT work well for broad news analysis, while domain-specific models like FinBERT deliver significantly higher accuracy for financial text by understanding industry-specific language and context.
Using Pre-trained Transformers

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer class TransformerSentiment: def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"): self.classifier = pipeline( "sentiment-analysis", model=model_name, device=0 # Use GPU if available ) def analyze(self, text, max_length=512): """Analyze sentiment of text.""" # Truncate if necessary if len(text) > max_length: text = text[:max_length] result = self.classifier(text)[0] return { "label": result["label"], "score": result["score"], "sentiment_value": 1 if result["label"] == "POSITIVE" else -1 }
def analyze_batch(self, texts, batch_size=32): """Analyze sentiment of multiple texts.""" results = self.classifier(texts, batch_size=batch_size) return [ { "label": r["label"], "score": r["score"], "sentiment_value": 1 if r["label"] == "POSITIVE" else -1 } for r in results ]

Financial News Sentiment

For financial news, use specialized models. According to a study published in the Journal of Financial Data Science, FinBERT outperforms general-purpose sentiment models by 12-15 percentage points on financial text classification tasks, making it the standard for quantitative finance applications.

from transformers import BertTokenizer, BertForSequenceClassification import torch class FinancialSentiment: def __init__(self): self.tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert") self.model = BertForSequenceClassification.from_pretrained("ProsusAI/finbert") self.labels = ["positive", "negative", "neutral"] def analyze(self, text): """Analyze financial sentiment.""" inputs = self.tokenizer( text, return_tensors="pt", truncation=True, max_length=512 ) with torch.no_grad(): outputs = self.model(inputs) probabilities = torch.softmax(outputs.logits, dim=1) predicted_class = torch.argmax(probabilities, dim=1).item()return { "label": self.labels[predicted_class], "probabilities": { label: prob.item() for label, prob in zip(self.labels, probabilities[0]) } }

How Does Entity-Level Sentiment Analysis Work?
Entity-level sentiment analysis combines Named Entity Recognition (NER) with sentence-level sentiment scoring to attribute positive or negative sentiment to specific companies, people, or organizations mentioned within an article, rather than classifying the entire article as a whole.
Named Entity Recognition + Sentiment

import spacy
from collections import defaultdict
class EntitySentiment:
    def __init__(self, sentiment_analyzer):
        self.nlp = spacy.load("en_core_web_sm")
        self.sentiment_analyzer = sentiment_analyzer
def extract_entity_sentiments(self, text):
        """Extract entities and their associated sentiments."""
        doc = self.nlp(text)
        entity_sentiments = defaultdict(list)
# Get all entities
        entities = [(ent.text, ent.label_) for ent in doc.ents]
# For each sentence containing an entity, analyze sentiment
        for sent in doc.sents:
            sent_text = sent.text
            sent_sentiment = self.sentiment_analyzer.analyze(sent_text)
for ent in sent.ents:
                entity_sentiments[ent.text].append({
                    "sentence": sent_text,
                    "sentiment": sent_sentiment,
                    "entity_type": ent.label_
                })
# Aggregate sentiments per entity
        results = {}
        for entity, mentions in entity_sentiments.items():
            avg_sentiment = sum(
                m["sentiment"]["sentiment_value"]  m["sentiment"]["score"]
                for m in mentions
            ) / len(mentions)
results[entity] = {
                "mention_count": len(mentions),
                "average_sentiment": avg_sentiment,
                "entity_type": mentions[0]["entity_type"],
                "mentions": mentions
            }
return results

We built a similar entity-level sentiment pipeline for our Sentiment Classification for News case study, where tracking sentiment toward specific companies and topics across thousands of articles was a core requirement.

Pipeline Integration

Complete Analysis Pipeline

import pandas as pd
from datetime import datetime
class NewsSentimentPipeline:
    def __init__(self, news_collector, preprocessor, sentiment_analyzer):
        self.collector = news_collector
        self.preprocessor = preprocessor
        self.analyzer = sentiment_analyzer
def run(self, query, days_back=7):
        """Run complete sentiment analysis pipeline."""
        # Collect articles
        articles = self.collector.get_articles(query, days_back)
results = []
        for article in articles:
            # Combine title and description
            text = f"{article.get('title', '')} {article.get('description', '')}"
# Preprocess
            processed_text = self.preprocessor.process(text)
# Analyze sentiment
            sentiment = self.analyzer.analyze(processed_text)
results.append({
                "title": article.get("title"),
                "source": article.get("source", {}).get("name"),
                "published_at": article.get("publishedAt"),
                "url": article.get("url"),
                "sentiment_label": sentiment["label"],
                "sentiment_score": sentiment["score"],
                "sentiment_value": sentiment["sentiment_value"]
            })
return pd.DataFrame(results)
def aggregate_by_date(self, df):
        """Aggregate sentiment by date."""
        df["date"] = pd.to_datetime(df["published_at"]).dt.date
return df.groupby("date").agg({
            "sentiment_value": "mean",
            "sentiment_score": "mean",
            "title": "count"
        }).rename(columns={"title": "article_count"})

Visualization and Reporting

Sentiment Trends

import matplotlib.pyplot as plt
import seaborn as sns
def plot_sentiment_trend(df, title="Sentiment Trend"):
    """Plot sentiment over time."""
    fig, ax1 = plt.subplots(figsize=(12, 6))
# Sentiment line
    ax1.plot(df.index, df["sentiment_value"], color="blue", label="Sentiment")
    ax1.set_xlabel("Date")
    ax1.set_ylabel("Average Sentiment", color="blue")
    ax1.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
# Article count bars
    ax2 = ax1.twinx()
    ax2.bar(df.index, df["article_count"], alpha=0.3, color="green", label="Articles")
    ax2.set_ylabel("Article Count", color="green")
plt.title(title)
    fig.tight_layout()
    plt.savefig("sentiment_trend.png")
    plt.show()

Source Analysis

def analyze_by_source(df):
    """Analyze sentiment by news source."""
    source_sentiment = df.groupby("source").agg({
        "sentiment_value": ["mean", "std"],
        "title": "count"
    }).round(3)
source_sentiment.columns = ["avg_sentiment", "sentiment_std", "article_count"]
    source_sentiment = source_sentiment.sort_values("article_count", ascending=False)
return source_sentiment

Production Deployment

AWS Lambda Function

import json
import boto3
def lambda_handler(event, context):
    """AWS Lambda handler for sentiment analysis."""
    # Parse input
    body = json.loads(event.get("body", "{}"))
    query = body.get("query", "technology")
    days = body.get("days", 7)
# Run pipeline
    pipeline = NewsSentimentPipeline(
        news_collector=NewsCollector(api_key=get_api_key()),
        preprocessor=TextPreprocessor(),
        sentiment_analyzer=TransformerSentiment()
    )
results = pipeline.run(query, days)
# Store results
    s3 = boto3.client("s3")
    s3.put_object(
        Bucket="sentiment-results",
        Key=f"results/{query}/{datetime.now().isoformat()}.json",
        Body=results.to_json()
    )
return {
        "statusCode": 200,
        "body": json.dumps({
            "articles_analyzed": len(results),
            "average_sentiment": results["sentiment_value"].mean()
        })
    }

According to Hugging Face's 2024 State of AI report, transformer-based sentiment models have become the default choice for production NLP, with over 80% of new sentiment analysis deployments using pre-trained transformers rather than traditional rule-based or bag-of-words approaches.

How BeyondScale Can Help

At BeyondScale, we specialize in building production-grade NLP and sentiment analysis pipelines tailored to your industry and data. Whether you're monitoring brand reputation across news media, building financial sentiment signals for trading, or tracking public perception in real time, our team can help you go from prototype to production with confidence.

Explore our AI Development services | See our Sentiment Classification case study

Conclusion

Building a news sentiment analysis pipeline involves multiple components working together. The key steps are:

Data Collection: Gather news from multiple sources

Preprocessing: Clean and normalize text

Analysis: Apply appropriate sentiment models

Aggregation: Combine results meaningfully

Visualization: Present insights effectively

Consider model selection carefully - general-purpose models work well for broad analysis, while domain-specific models (like FinBERT) excel in specialized contexts. Regular evaluation and retraining ensure accuracy as language patterns evolve.

Frequently Asked Questions

How accurate is sentiment analysis for news articles?

General-purpose transformer models like DistilBERT achieve 85-90% accuracy on news sentiment classification. Domain-specific models such as FinBERT, trained on financial text, can reach 90-95% accuracy for their target domain. Accuracy improves further with fine-tuning on your specific data and regular evaluation.

Can NLP sentiment analysis be used for financial market predictions?

Yes, NLP-based sentiment analysis of financial news is widely used for market intelligence. Studies show that news sentiment correlates with short-term stock price movements, and many hedge funds and trading firms incorporate real-time news sentiment signals into their quantitative strategies using specialized models like FinBERT.

How do you perform real-time sentiment analysis on news feeds?

Real-time news sentiment analysis involves collecting articles via RSS feeds or news APIs, preprocessing and cleaning the text, running it through a pre-trained or fine-tuned transformer model, and storing results for aggregation. AWS Lambda or similar serverless functions enable scalable, event-driven processing.

Which model should I choose for news sentiment analysis?

For general news sentiment, DistilBERT fine-tuned on SST-2 is a fast and effective baseline. For financial news, FinBERT is the standard choice as it is trained specifically on financial text. For entity-level sentiment, combine a sentiment model with spaCy NER to attribute sentiment to specific companies or topics.

Sentiment Analysis of News Articles: Building an ML Pipeline

What Are the Key Use Cases for News Sentiment Analysis?

Pipeline Architecture

Data Collection

News API Integration

Usage

RSS Feed Processing

Text Preprocessing

Cleaning and Normalization

How Do You Choose the Right Sentiment Analysis Model?

Using Pre-trained Transformers

Financial News Sentiment

How Does Entity-Level Sentiment Analysis Work?

Named Entity Recognition + Sentiment