AI & Machine Learning

Sentiment Analysis of News Articles: Building an ML Pipeline

BT

BeyondScale Team

AI/ML Team

January 7, 20269 min read

Sentiment analysis of news articles enables organizations to track public perception, monitor brand reputation, and identify emerging trends. This guide walks through building a production-ready sentiment analysis pipeline.

> Key Takeaways > > - Pre-trained transformer models like DistilBERT and FinBERT provide strong baselines for news sentiment with 85-95% accuracy > - Entity-level sentiment analysis lets you attribute positive or negative sentiment to specific companies, people, or topics within an article > - A complete pipeline includes data collection, preprocessing, model inference, aggregation, and visualization > - Serverless deployment on AWS Lambda enables cost-effective, scalable real-time sentiment processing

What Are the Key Use Cases for News Sentiment Analysis?

News sentiment analysis uses NLP to classify articles as positive, negative, or neutral, enabling use cases from brand monitoring and market intelligence to crisis detection and competitive analysis.
  • Brand Monitoring: Track how your company is portrayed in media
  • Market Intelligence: Gauge market sentiment for investment decisions
  • Public Relations: Monitor crisis situations in real-time
  • Competitive Analysis: Understand perception of competitors
  • Trend Detection: Identify emerging topics and sentiments
According to Grand View Research, the global sentiment analytics market was valued at $4.4 billion in 2024 and is projected to grow at a CAGR of 14.2% through 2030, driven by increasing demand for real-time media monitoring and customer insight. Organizations across finance, healthcare, and technology increasingly rely on custom AI development to build sentiment pipelines tailored to their specific domains.

Pipeline Architecture

News Sources
    │
    ▼
┌─────────────────┐
│ Data Collection │
│  (RSS, APIs)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Text Extraction │
│   & Cleaning    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Sentiment    │
│    Analysis     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Aggregation   │
│   & Storage     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Visualization  │
│   & Alerting    │
└─────────────────┘

Data Collection

News API Integration

import requests
from datetime import datetime, timedelta

class NewsCollector: def __init__(self, api_key): self.api_key = api_key self.base_url = "https://newsapi.org/v2"

def get_articles(self, query, days_back=7): """Fetch news articles for a query.""" from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')

response = requests.get( f"{self.base_url}/everything", params={ "q": query, "from": from_date, "sortBy": "publishedAt", "language": "en", "apiKey": self.api_key } )

if response.status_code == 200: return response.json().get("articles", []) return []

Usage

collector = NewsCollector(api_key="your_api_key") articles = collector.get_articles("artificial intelligence")

RSS Feed Processing

import feedparser
from concurrent.futures import ThreadPoolExecutor

class RSSCollector: def __init__(self, feed_urls): self.feed_urls = feed_urls

def parse_feed(self, url): """Parse a single RSS feed.""" feed = feedparser.parse(url) articles = []

for entry in feed.entries: articles.append({ "title": entry.get("title", ""), "summary": entry.get("summary", ""), "link": entry.get("link", ""), "published": entry.get("published", ""), "source": feed.feed.get("title", url) })

return articles

def collect_all(self): """Collect from all feeds concurrently.""" with ThreadPoolExecutor(max_workers=10) as executor: results = executor.map(self.parse_feed, self.feed_urls)

return [article for articles in results for article in articles]

Text Preprocessing

Cleaning and Normalization

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')

class TextPreprocessor: def __init__(self): self.lemmatizer = WordNetLemmatizer() self.stop_words = set(stopwords.words('english'))

def clean_text(self, text): """Clean and normalize text.""" # Convert to lowercase text = text.lower()

# Remove URLs text = re.sub(r'http\S+|www\S+|https\S+', '', text)

# Remove HTML tags text = re.sub(r'<.?>', '', text)

# Remove special characters (keep letters and spaces) text = re.sub(r'[^a-zA-Z\s]', '', text)

# Remove extra whitespace text = ' '.join(text.split())

return text

def tokenize_and_lemmatize(self, text): """Tokenize and lemmatize text.""" tokens = word_tokenize(text)

# Remove stopwords and lemmatize tokens = [ self.lemmatizer.lemmatize(token) for token in tokens if token not in self.stop_words and len(token) > 2 ]

return tokens

def process(self, text): """Full preprocessing pipeline.""" cleaned = self.clean_text(text) tokens = self.tokenize_and_lemmatize(cleaned) return ' '.join(tokens)

How Do You Choose the Right Sentiment Analysis Model?

Model selection depends on your domain: general-purpose transformers like DistilBERT work well for broad news analysis, while domain-specific models like FinBERT deliver significantly higher accuracy for financial text by understanding industry-specific language and context.

Using Pre-trained Transformers

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

class TransformerSentiment: def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"): self.classifier = pipeline( "sentiment-analysis", model=model_name, device=0 # Use GPU if available )

def analyze(self, text, max_length=512): """Analyze sentiment of text.""" # Truncate if necessary if len(text) > max_length: text = text[:max_length]

result = self.classifier(text)[0]

return { "label": result["label"], "score": result["score"], "sentiment_value": 1 if result["label"] == "POSITIVE" else -1 }

def analyze_batch(self, texts, batch_size=32): """Analyze sentiment of multiple texts.""" results = self.classifier(texts, batch_size=batch_size) return [ { "label": r["label"], "score": r["score"], "sentiment_value": 1 if r["label"] == "POSITIVE" else -1 } for r in results ]

Financial News Sentiment

For financial news, use specialized models. According to a study published in the Journal of Financial Data Science, FinBERT outperforms general-purpose sentiment models by 12-15 percentage points on financial text classification tasks, making it the standard for quantitative finance applications.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

class FinancialSentiment: def __init__(self): self.tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert") self.model = BertForSequenceClassification.from_pretrained("ProsusAI/finbert") self.labels = ["positive", "negative", "neutral"]

def analyze(self, text): """Analyze financial sentiment.""" inputs = self.tokenizer( text, return_tensors="pt", truncation=True, max_length=512 )

with torch.no_grad(): outputs = self.model(inputs)

probabilities = torch.softmax(outputs.logits, dim=1) predicted_class = torch.argmax(probabilities, dim=1).item()

return { "label": self.labels[predicted_class], "probabilities": { label: prob.item() for label, prob in zip(self.labels, probabilities[0]) } }

How Does Entity-Level Sentiment Analysis Work?

Entity-level sentiment analysis combines Named Entity Recognition (NER) with sentence-level sentiment scoring to attribute positive or negative sentiment to specific companies, people, or organizations mentioned within an article, rather than classifying the entire article as a whole.

Named Entity Recognition + Sentiment

import spacy
from collections import defaultdict

class EntitySentiment: def __init__(self, sentiment_analyzer): self.nlp = spacy.load("en_core_web_sm") self.sentiment_analyzer = sentiment_analyzer

def extract_entity_sentiments(self, text): """Extract entities and their associated sentiments.""" doc = self.nlp(text) entity_sentiments = defaultdict(list)

# Get all entities entities = [(ent.text, ent.label_) for ent in doc.ents]

# For each sentence containing an entity, analyze sentiment for sent in doc.sents: sent_text = sent.text sent_sentiment = self.sentiment_analyzer.analyze(sent_text)

for ent in sent.ents: entity_sentiments[ent.text].append({ "sentence": sent_text, "sentiment": sent_sentiment, "entity_type": ent.label_ })

# Aggregate sentiments per entity results = {} for entity, mentions in entity_sentiments.items(): avg_sentiment = sum( m["sentiment"]["sentiment_value"] m["sentiment"]["score"] for m in mentions ) / len(mentions)

results[entity] = { "mention_count": len(mentions), "average_sentiment": avg_sentiment, "entity_type": mentions[0]["entity_type"], "mentions": mentions }

return results

We built a similar entity-level sentiment pipeline for our Sentiment Classification for News case study, where tracking sentiment toward specific companies and topics across thousands of articles was a core requirement.

Pipeline Integration

Complete Analysis Pipeline

import pandas as pd
from datetime import datetime

class NewsSentimentPipeline: def __init__(self, news_collector, preprocessor, sentiment_analyzer): self.collector = news_collector self.preprocessor = preprocessor self.analyzer = sentiment_analyzer

def run(self, query, days_back=7): """Run complete sentiment analysis pipeline.""" # Collect articles articles = self.collector.get_articles(query, days_back)

results = [] for article in articles: # Combine title and description text = f"{article.get('title', '')} {article.get('description', '')}"

# Preprocess processed_text = self.preprocessor.process(text)

# Analyze sentiment sentiment = self.analyzer.analyze(processed_text)

results.append({ "title": article.get("title"), "source": article.get("source", {}).get("name"), "published_at": article.get("publishedAt"), "url": article.get("url"), "sentiment_label": sentiment["label"], "sentiment_score": sentiment["score"], "sentiment_value": sentiment["sentiment_value"] })

return pd.DataFrame(results)

def aggregate_by_date(self, df): """Aggregate sentiment by date.""" df["date"] = pd.to_datetime(df["published_at"]).dt.date

return df.groupby("date").agg({ "sentiment_value": "mean", "sentiment_score": "mean", "title": "count" }).rename(columns={"title": "article_count"})

Visualization and Reporting

import matplotlib.pyplot as plt
import seaborn as sns

def plot_sentiment_trend(df, title="Sentiment Trend"): """Plot sentiment over time.""" fig, ax1 = plt.subplots(figsize=(12, 6))

# Sentiment line ax1.plot(df.index, df["sentiment_value"], color="blue", label="Sentiment") ax1.set_xlabel("Date") ax1.set_ylabel("Average Sentiment", color="blue") ax1.axhline(y=0, color="gray", linestyle="--", alpha=0.5)

# Article count bars ax2 = ax1.twinx() ax2.bar(df.index, df["article_count"], alpha=0.3, color="green", label="Articles") ax2.set_ylabel("Article Count", color="green")

plt.title(title) fig.tight_layout() plt.savefig("sentiment_trend.png") plt.show()

Source Analysis

def analyze_by_source(df):
    """Analyze sentiment by news source."""
    source_sentiment = df.groupby("source").agg({
        "sentiment_value": ["mean", "std"],
        "title": "count"
    }).round(3)

source_sentiment.columns = ["avg_sentiment", "sentiment_std", "article_count"] source_sentiment = source_sentiment.sort_values("article_count", ascending=False)

return source_sentiment

Production Deployment

AWS Lambda Function

import json
import boto3

def lambda_handler(event, context): """AWS Lambda handler for sentiment analysis.""" # Parse input body = json.loads(event.get("body", "{}")) query = body.get("query", "technology") days = body.get("days", 7)

# Run pipeline pipeline = NewsSentimentPipeline( news_collector=NewsCollector(api_key=get_api_key()), preprocessor=TextPreprocessor(), sentiment_analyzer=TransformerSentiment() )

results = pipeline.run(query, days)

# Store results s3 = boto3.client("s3") s3.put_object( Bucket="sentiment-results", Key=f"results/{query}/{datetime.now().isoformat()}.json", Body=results.to_json() )

return { "statusCode": 200, "body": json.dumps({ "articles_analyzed": len(results), "average_sentiment": results["sentiment_value"].mean() }) }

According to Hugging Face's 2024 State of AI report, transformer-based sentiment models have become the default choice for production NLP, with over 80% of new sentiment analysis deployments using pre-trained transformers rather than traditional rule-based or bag-of-words approaches.

How BeyondScale Can Help

At BeyondScale, we specialize in building production-grade NLP and sentiment analysis pipelines tailored to your industry and data. Whether you're monitoring brand reputation across news media, building financial sentiment signals for trading, or tracking public perception in real time, our team can help you go from prototype to production with confidence.

Explore our AI Development services | See our Sentiment Classification case study

Conclusion

Building a news sentiment analysis pipeline involves multiple components working together. The key steps are:

  • Data Collection: Gather news from multiple sources
  • Preprocessing: Clean and normalize text
  • Analysis: Apply appropriate sentiment models
  • Aggregation: Combine results meaningfully
  • Visualization: Present insights effectively
  • Consider model selection carefully - general-purpose models work well for broad analysis, while domain-specific models (like FinBERT) excel in specialized contexts. Regular evaluation and retraining ensure accuracy as language patterns evolve.

    Frequently Asked Questions

    How accurate is sentiment analysis for news articles?

    General-purpose transformer models like DistilBERT achieve 85-90% accuracy on news sentiment classification. Domain-specific models such as FinBERT, trained on financial text, can reach 90-95% accuracy for their target domain. Accuracy improves further with fine-tuning on your specific data and regular evaluation.

    Can NLP sentiment analysis be used for financial market predictions?

    Yes, NLP-based sentiment analysis of financial news is widely used for market intelligence. Studies show that news sentiment correlates with short-term stock price movements, and many hedge funds and trading firms incorporate real-time news sentiment signals into their quantitative strategies using specialized models like FinBERT.

    How do you perform real-time sentiment analysis on news feeds?

    Real-time news sentiment analysis involves collecting articles via RSS feeds or news APIs, preprocessing and cleaning the text, running it through a pre-trained or fine-tuned transformer model, and storing results for aggregation. AWS Lambda or similar serverless functions enable scalable, event-driven processing.

    Which model should I choose for news sentiment analysis?

    For general news sentiment, DistilBERT fine-tuned on SST-2 is a fast and effective baseline. For financial news, FinBERT is the standard choice as it is trained specifically on financial text. For entity-level sentiment, combine a sentiment model with spaCy NER to attribute sentiment to specific companies or topics.

    Share this article:
    AI & Machine Learning
    BT

    BeyondScale Team

    AI/ML Team

    AI/ML Team at BeyondScale Technologies, an ISO 27001 certified AI consulting firm and AWS Partner. Specializing in enterprise AI agents, multi-agent systems, and cloud architecture.

    Ready to Transform with AI Agents?

    Schedule a consultation with our team to explore how AI agents can revolutionize your operations and drive measurable outcomes.