Predicting Product URLs Using TF‑IDF — A Scalable Web Crawler
2025-06-11
Web scraping is essential for e-commerce analysis, price monitoring, and market intelligence. But identifying product URLs across different domains is a non-trivial, non-scalable problem.
This project uses TF‑IDF vectorization + a machine learning classifier to predict which URLs on a page are product pages—without relying on site-specific rules.
Here's a high level diagram of the scaled implementation:
Problem Statement
Traditional crawlers use handcrafted logic to extract product URLs from e-commerce sites. However:
- Every site has a different layout.
- Hard-coded patterns break often.
- Rule-based systems don't scale well.
Goal: Build a domain-agnostic classifier that can predict if a given URL is a product page based on patterns in the URL text.
The important issue of scaling such an application to other websites is its ability to filter product urls across 100s of other websites.
We can start with as simple as finding patterns in the url, like [/product/, /item/, /p/].
But this approach constantly needs to be updated with every website which do not follow any of the given pattern
Instead of using a Rule-based approach, this project uses machine learning for better scalability.
| Approach | Pros | Cons |
|---|---|---|
Rule-based |
Fast | Hard to maintain, low recall |
Machine Learning |
Scalable | Overhead of training model and labelling |
The feature extraction was done using TF-IDF as we can generalize the classification by dividing urls into smaller parts and figuring out the frequency.
Data from 3 different ecommerce websites was labbeled and vectorized to train using XGBoost. This produced a model with very accurate results.
Why TF-IDF?
This method revolves around the frequency and importance of words.
In the case of URL classification, I noticed a common pattern that all urls have the product name (more like a description) in it. And these descriptions are more or less repetative.
To support my argument, I did an analysis on a 8lakh+ URLs. here's what I found:

In the above distribution graph, once can observe the frequency of each word in the URLs. There's bunch of words which are mostly repeated. For example, shirt, cotton, white, blue
Thus relying on the frequency made more sense.
Approach
- Crawl web pages and extract candidate URLs.
- Tokenize each URL.
- Convert URLs to TF‑IDF feature vectors.
- Train a classifier (e.g., logistic regression) using labeled URLs.
- Predict new product URLs with high precision/recall.
How TF‑IDF Helps
TF‑IDF stands for Term Frequency – Inverse Document Frequency. It highlights tokens (substrings or words) that are important in a URL:
- TF – Frequency of a token in one URL.
- IDF – Rarity of the token across all URLs.
For example, in /product/12345/shoes, the token "product" might appear in many product URLs but not in category or blog URLs. TF‑IDF gives it a high score.
This lets the model learn meaningful patterns in URL structures without depending on hardcoded keywords.
Result
The model was trained on a labelled dataset of 16000+ URLs. The model was then tested with all the URLs from TataCLiq and the results were amazing.
Scraped URLs from TataCliq: 8,11,745
False Positives: 4
False negatives: 0
Total URLs collected from the above three domains: 8,28,292
The collected URLs went through the model created earlier and predicted product URLS.
Total product URLs: 8,24,761
The result is stored in product_urls.json.zip
It took around 15 minutes to get the product urls
Directory Structure
web_crawler/
│
├── crawler/
│ ├── main.py # Entry point for the crawler
│ ├── extract.py # Extracts URLs and content
│ ├── model.py # TF-IDF + ML model logic
│ └── utils.py # Miscellaneous helpers
│
├── data/ # Sample URLs and labels
├── examples/ # Use cases and test cases
├── url_model.joblib # Trained model
└── README.md
Summary Table
| Challenge | Approach | Benefit |
|---|---|---|
| Non-scalable scraping | Predict with trained classifier | Adapts to new websites |
| Site-specific logic | Use TF‑IDF + classifier | Robust to layout changes |
| Labeling difficulty | Start small, expand iteratively | Easy to improve accuracy over time |
| Prediction speed | Vectorize and classify quickly | Efficient at crawl time |
Future Improvements
- Use URL depth and query parameters as features.
- Add support for Naive Bayes, Random Forest, or BERT embeddings.
- Use active learning to request user labels on uncertain predictions.
- Visualize predictions with confidence scores.