Web scraping is essential for e-commerce analysis, price monitoring, and market intelligence. But identifying product URLs across different domains is a non-trivial, non-scalable problem.

This project uses TF‑IDF vectorization + a machine learning classifier to predict which URLs on a page are product pages—without relying on site-specific rules.

Here's a high level diagram of the scaled implementation:

design

Problem Statement

Traditional crawlers use handcrafted logic to extract product URLs from e-commerce sites. However:

Every site has a different layout.
Hard-coded patterns break often.
Rule-based systems don't scale well.

Goal: Build a domain-agnostic classifier that can predict if a given URL is a product page based on patterns in the URL text.

The important issue of scaling such an application to other websites is its ability to filter product urls across 100s of other websites. We can start with as simple as finding patterns in the url, like [/product/, /item/, /p/]. But this approach constantly needs to be updated with every website which do not follow any of the given pattern

Instead of using a Rule-based approach, this project uses machine learning for better scalability.

Approach	Pros	Cons
`Rule-based`	Fast	Hard to maintain, low recall
`Machine Learning`	Scalable	Overhead of training model and labelling

The feature extraction was done using TF-IDF as we can generalize the classification by dividing urls into smaller parts and figuring out the frequency. Data from 3 different ecommerce websites was labbeled and vectorized to train using XGBoost. This produced a model with very accurate results.

Why TF-IDF?

This method revolves around the frequency and importance of words.

In the case of URL classification, I noticed a common pattern that all urls have the product name (more like a description) in it. And these descriptions are more or less repetative.

To support my argument, I did an analysis on a 8lakh+ URLs. here's what I found:

Frequncy distribution

In the above distribution graph, once can observe the frequency of each word in the URLs. There's bunch of words which are mostly repeated. For example, shirt, cotton, white, blue

Thus relying on the frequency made more sense.

Approach

Crawl web pages and extract candidate URLs.
Tokenize each URL.
Convert URLs to TF‑IDF feature vectors.
Train a classifier (e.g., logistic regression) using labeled URLs.
Predict new product URLs with high precision/recall.

How TF‑IDF Helps

TF‑IDF stands for Term Frequency – Inverse Document Frequency. It highlights tokens (substrings or words) that are important in a URL:

TF – Frequency of a token in one URL.
IDF – Rarity of the token across all URLs.

For example, in /product/12345/shoes, the token "product" might appear in many product URLs but not in category or blog URLs. TF‑IDF gives it a high score.

This lets the model learn meaningful patterns in URL structures without depending on hardcoded keywords.

Result

The model was trained on a labelled dataset of 16000+ URLs. The model was then tested with all the URLs from TataCLiq and the results were amazing.

Scraped URLs from TataCliq: 8,11,745

False Positives: 4

False negatives: 0

Total URLs collected from the above three domains: 8,28,292

The collected URLs went through the model created earlier and predicted product URLS.

Total product URLs: 8,24,761

The result is stored in product_urls.json.zip

It took around 15 minutes to get the product urls

Directory Structure

web_crawler/
│
├── crawler/
│   ├── main.py          # Entry point for the crawler
│   ├── extract.py       # Extracts URLs and content
│   ├── model.py         # TF-IDF + ML model logic
│   └── utils.py         # Miscellaneous helpers
│
├── data/                # Sample URLs and labels
├── examples/            # Use cases and test cases
├── url_model.joblib     # Trained model
└── README.md

Summary Table

Challenge	Approach	Benefit
Non-scalable scraping	Predict with trained classifier	Adapts to new websites
Site-specific logic	Use TF‑IDF + classifier	Robust to layout changes
Labeling difficulty	Start small, expand iteratively	Easy to improve accuracy over time
Prediction speed	Vectorize and classify quickly	Efficient at crawl time

Future Improvements

Use URL depth and query parameters as features.
Add support for Naive Bayes, Random Forest, or BERT embeddings.
Use active learning to request user labels on uncertain predictions.
Visualize predictions with confidence scores.

Predicting Product URLs Using TF‑IDF — A Scalable Web Crawler