Choosing a web page to markdown extractor

markdown
parsing
tools
I used ChatGPT Deep Research to find an HTML to Markdown tool and it suggested a really good one I’d not come across before—trafilatura
Published

April 12, 2025

Personal takeaways

  • Trafilatura looks perfect for my requirements.
  • A Python library or command line tool.
  • Open Source, actively maintained, 4.1k stars.
  • Widely used, including by HuggingFace and Microsoft Research.
  • The result of a Phd project, tested against other similar tools for accuracy, and winner of various benchmark tests.
  • Extracts metadata like author, date, categories and tags.
  • It even has built-in functionality to work with RSS feeds and sitemaps.
  • Smart and polite crawling.

One of the things I want to do is take all the bookmarks I’ve saved over the years (recently using Raindrop, and prior to 2022 using Pinboard, and classify them, to make a map of my diverse interests. In order to do that properly I want to extract the content of those pages in Markdown format. So what is the best method to do that?

I’ve found that a really good way to find the right tool to use in this kind of situation is to use ChatGPT Deep Research. I’m pleased I did in this case because I’d researched this some months ago by myself and not come across the best solution.

Here’s the prompt I used in this case:

I want to extract the text from URLs and get it in markdown format, along with metadata about the page such as title and (where possible) date of publication. The markdown should be clean, without all the navigation elements or advertising or whatever. The method could be a command-line in Python or nodejs (for example) or an application for OSX. The solution should ideally be open source, but a paid application would be acceptable if it is really good and not expensive.

These are a few I am considering: 
https://github.com/kepano/defuddle
https://newspaper.readthedocs.io/en/latest/ 
https://github.com/buriy/python-readability 
https://github.com/goose3/goose3 
https://github.com/microsoft/markitdown 

If the solution is open source preferably it should be a popular, active project with a lot of stars.

And this is the result:


Comparison of Web Content Extraction Tools for Markdown Conversion

Extracting the main content of web pages and converting it into clean Markdown (with minimal clutter) can be achieved with several open-source libraries and a few affordable services. Below we compare top tools on key criteria: output format (clean Markdown), metadata extraction (title, publication date, etc.), batch processing support, scriptability (CLI or library), ability to handle sites that block bots or require a headless browser, and the tool’s maintenance and popularity. Both open-source solutions (preferably active and popular) and a couple of affordable commercial options are covered.

Feature Comparison Overview

Tool Platform Markdown Output Metadata Extracted Batch/CLI Support Handling Hard Sites Maintenance & Popularity
Defuddle (open) Node.js (TypeScript) Yes (direct option) Title, author, date, etc. (schema.org) 1 Yes (CLI & library) No JS execution (needs pre-fetch) Active (2023+, new project)
Mercury Parser (open) Node.js Yes (option for MD) 2 Title, author, date, excerpt 3 Yes (CLI & library) No JS (requires static HTML) Popular (~5k★); stable but few recent updates
Readability (python) (open) Python (library) Indirect (HTML/text output) Title + main content (no structured date/author) 4 Yes (library scriptable) No JS (requires static HTML) Moderate (~2.7k★); updated to latest readability.js 5
Newspaper3k/4k (open) Python No (text only output) Title, authors, publish date, top image (if found) Yes (library; can loop) No JS (HTTP fetch, set user-agent) Very popular (~12k★); original unmaintained since 2020 6 (fork active in 2024)
Goose3 (open) Python No (text only output) Title, meta description, meta tags, main image 7 Yes (library; context-manager) No JS (HTTP fetch, UA configurable) Active (last commit 2025); moderate usage (~1k★)
Trafilatura (open) Python Yes (direct to MD) 8 Title, author, date, site name, tags 9 Yes (CLI & library) No JS (polite crawling; static HTML) Active (regular updates, ~4k★) 10; highly accurate
MarkItDown (open) Python Yes (converts HTML->MD) Not specialized (converts all content; no explicit article metadata) Yes (CLI & library) No JS (convert provided content) Extremely popular (~25k★ in 2024) 11; new MS project
Instaparser API (commercial) Web API Yes (returns text/HTML; MD via “Text API”) Title, body text, images, videos 12 (via Article API) Yes (API supports bulk calls) Uses Instapaper parser (optimizes mobile, can accept behind-paywall HTML) 13 Commercial ($99/mo for 100k calls; free 1000 trial) 14
Apify Article Extractor (commercial) Cloud API/Platform Yes (returns Markdown) 15 Title, description, author, date, content 16 Yes (API & batch actor) Headless browser (runs JS, bypasses bot blocks) Commercial (pay-as-you-go; free tier available)

Legend: “open” = open-source; ★ = GitHub stars (approximate).

Below, we detail each tool with its features, pros, cons, and ideal use cases.

Open-Source Tools

Defuddle (Node.js) – Modern Content Cleaner

Defuddle is a new open-source Node.js library (TypeScript) designed to extract the primary content from web pages while removing clutter like sidebars, headers, footers, and comments 17 18. It outputs a cleaned HTML or Markdown string and captures extensive metadata. Defuddle was originally developed for the Obsidian Web Clipper, aiming to provide cleaner input for Markdown conversion 19. It can serve as a drop-in replacement for Mozilla’s Readability with some enhancements: it’s more forgiving (removing fewer elements if unsure), preserves structured elements like footnotes, handles code blocks and math consistently, uses a page’s mobile layout to identify junk elements, and extracts rich metadata (including schema.org info) 20. Defuddle returns an object with fields such as content (clean HTML or MD), title, author, published date, domain, site name, description, main image, etc 21. A companion CLI tool (defuddle-cli) is available for command-line use.

Pros:
- Clean Markdown output: Built-in option to convert content to Markdown (markdown: true) for easy export 22.
- Rich metadata extraction: Captures title, author, publication date, description, site name, and more (even parsing schema.org microdata) 23 24.
- Active development: Recent project with modern JS implementation; integration with Obsidian tools indicates active maintenance and practical use.
- Node.js scriptability: Can be used as an NPM library or via CLI, and it leverages JSDOM for parsing, making it straightforward to integrate in Node workflows 25 26.

Cons:
- Work in progress: The author cautions that Defuddle is still evolving 27, so there may be occasional bugs or changes. It’s newer and less battle-tested than older parsers.
- No built-in browser automation: Defuddle itself doesn’t execute JavaScript on pages. It relies on static HTML (JSDOM). For sites that require JS (e.g. to load content or bypass anti-bot measures), you must fetch the rendered HTML separately (e.g. using a headless browser or an API) before feeding it to Defuddle.
- Smaller user base: Being new, it has fewer users and GitHub stars than long-established projects, though it’s gaining traction among Obsidian and web clipper users.

Use case: Defuddle is great for a Node.js environment where you want clean Markdown and metadata directly. If you’re clipping articles or building a Markdown knowledge base (for example, with Obsidian), Defuddle provides a convenient all-in-one solution. Just be mindful to handle dynamic pages separately and expect rapid iteration as the project matures.

Mercury Parser (Postlight) – Battle-Tested Article Parser

Mercury Parser (by Postlight) is a well-known open-source JavaScript tool (available via NPM as @postlight/parser) for extracting meaningful content from web pages 28. It takes a URL (or HTML) and returns a structured JSON containing the article’s title, author, date published, lead image URL, excerpt (dek), and the cleaned content 29. Mercury’s parsing focuses on what humans care about – the main text and important metadata – while stripping ads and navigation. It also allows output in HTML, plain text, or Markdown: for example, setting contentType: 'markdown' returns the content as GitHub-flavored Markdown 30. Mercury was the engine behind the Mercury Reader browser extension and the now-defunct Mercury Web Parser API. It supports custom domain-specific parsing rules for tricky sites, which developers can add via CSS selectors 31.

Pros:
- Accurate content extraction: Mercury’s algorithm (inspired by the Arc90 Readability heuristics) is tuned for a wide variety of sites and was used in production by many apps, yielding reliable results for articles and blog posts. It captures title, author, and published date when available 32.
- Markdown output support: It can directly return content in Markdown format, preserving basic formatting like bold, links, lists, and headings 33. This saves an extra conversion step.
- Customizable parsing rules: Developers can extend or override its parser on a per-domain basis using simple CSS selectors 34, allowing refinement for sites where the default heuristics fall short.
- Popular and well-documented: With ~5.5k stars on GitHub and a lot of community usage, Mercury is well-documented and there are examples for common scenarios. It’s a proven solution for content extraction tasks.

Cons:
- No JavaScript execution: Mercury fetches and parses static HTML. It does not run a headless browser, so it will not retrieve content that requires client-side rendering or login. If a page’s content is hidden behind scripts or load-more buttons, Mercury alone cannot extract it.
- Maintenance status: Postlight’s Mercury Parser is open source but no longer actively developed by the original team (the public API was shut down in 2020 35). The library remains usable and stable, but any new web technology or changes (e.g., new kinds of HTML structures) might not be specifically addressed unless community contributors step in.
- Node.js only: It’s designed for Node (though it can be run in the browser as well). There is no official Python version. (Python users would use readability or other libraries instead.)

Use case: Mercury is a solid choice if you are working in Node.js and want a stable, high-quality parser for article content. It’s especially useful when you need Markdown output with minimal fuss. If you encounter a site it doesn’t parse well, you can often tweak the rules. Just remember to handle sites requiring dynamic content loading with an external solution (e.g., fetch the page with Puppeteer then pass the HTML to Mercury’s Parser.parse()).

python-readability (Readability/LXML) – Firefox’s Reader Mode Algorithm in Python

python-readability (also known as readability-lxml) is a fast Python port of the Arc90 Readability algorithm (as refined by Mozilla) 36. Essentially, it brings the logic behind Firefox’s “Reader View” to Python. Given an HTML document, it identifies the main article content by scoring paragraphs and removes boilerplate (headers, footers, sidebars). The output is typically a chunk of HTML for the content, which you can then convert to text or Markdown. It also provides a cleaned title. This library is lightweight and easy to use (installable via pip install readability-lxml). It’s more of a low-level component – it focuses on content extraction and doesn’t provide a ton of extras beyond that.

Pros:
- Proven algorithm: Implements the Readability.js approach that’s widely used in browsers for content extraction. This means it’s reasonably accurate for typical articles, as it’s based on years of tuning for Reader Mode.
- Simple and fast: The library is straightforward – you feed it HTML and get back the main content (e.g., via Document(html).summary() to get HTML of the article). It doesn’t pull in heavy dependencies beyond lxml. It’s quite fast and can process many pages quickly, suitable for batch jobs.
- Python integration: Works within a Python script easily. You can combine it with other Python tools (e.g. use Requests to fetch pages, feed HTML to readability-lxml, then perhaps use a Markdown converter). It’s scriptable for batches (just loop through URLs).

Cons:
- Limited metadata: Out of the box, it mainly gives you the content and maybe the article’s title (it often uses the <title> tag or an <h1> if appropriate). It does not explicitly extract publication date or author by itself. You would have to parse those separately (e.g., from meta tags or bylines) if needed.
- No direct Markdown output: The output is cleaned HTML. You will need an extra step/library to convert that HTML to Markdown if you want to preserve formatting (for example, using a converter like html2text or markdownify). This is doable but not as convenient as tools that output Markdown directly.
- No JS or advanced fetching: Like most Python scrapers, it will not handle dynamic content without help. Also, if a site’s raw HTML is heavily malformed, Readability might not parse it well unless lxml can clean it.
- Moderate maintenance: The library does get updates to track Mozilla’s Readability improvements, but it’s not a large active project with a big community. It has a modest user base. For most cases the algorithm is stable, but if something breaks (e.g., due to a new HTML pattern), fixes might rely on a small number of contributors.

Use case: Use python-readability when you want to implement the classic Reader Mode extraction in a Python pipeline and you don’t need extensive metadata. It’s a good choice if you plan to handle metadata extraction and Markdown conversion yourself or you just need plain text. For example, if you have 3000 news article URLs and you want to quickly get the text of each for NLP analysis, readability-lxml could do the job efficiently. If you require more structure or metadata, you might combine it with other tools or opt for a more feature-rich library.

Newspaper3k (and Newspaper4k) – News-Focused Extractor with NLP Features

Newspaper3k is a popular Python library for article extraction, especially from news sites. It not only pulls the main text, but also attempts to extract the title, authors, publish date, top image, videos, keywords, and a summary of the article 37. It was designed as an all-in-one solution for news content mining, including some basic NLP (natural language processing) tasks like keyword extraction and summarization. Newspaper is easy to use: you can feed it a URL and it will download the page (with Python’s requests), parse the article content, and give you fields like article.text (cleaned text), article.title, article.authors, article.publish_date, etc. However, the original Newspaper3k project has seen little maintenance after 2020. A community fork named Newspaper4k has emerged in 2023–2024 to update the library (e.g., adding support for more languages and fixing bugs) 38 39.

Pros:
- All-in-one pipeline: Newspaper handles fetching the URL, parsing HTML, extracting content and metadata, and even doing things like downloading images. It’s convenient – you give it a list of URLs or use its Article or Source classes, and it manages the rest.
- Metadata and extras: It automatically tries to get authors (by searching for author tags or names in the page), publication date (from meta tags or HTML), and top image (via Open Graph tags or heuristics). These features mean you often get more than just text.
- Text processing features: After extraction, Newspaper can do simple NLP: e.g., article.keywords and article.summary use TextRank or similar algorithms to summarize the content and list keywords. This can be handy if you need quick insights from the text.
- Batch support: You can feed a list of URLs to Newspaper by creating a Source (or simply looping). It’s designed to handle multiple articles (e.g., crawling a news website section). It’s scriptable and also has a multi-threading option for faster processing of many URLs.
- Popularity and community knowledge: Newspaper has been widely used in the Python community (commonly cited in blog posts about web scraping news). There are many Q&A and examples available, which can help troubleshoot issues.

Cons:
- Plain text output: Newspaper focuses on plain text extraction – article.text gives you the content without HTML formatting. While it’s clean (no ads or nav), it doesn’t preserve bold, lists, or heading hierarchy. Converting to Markdown would not retain structure since the structure is lost. If preserving formatting is important, this is a limitation.
- Maintenance gaps: The original library hasn’t been updated since 2020 40, and users reported issues with some sites and TLS/SSL problems. The Newspaper4k fork is addressing some issues (latest release in 2024), but it’s essentially a community fix. There may still be unresolved bugs or incompatibilities. Use of Newspaper might require applying community fixes or using the fork for better results.
- Dynamic content and blocking: Newspaper’s HTTP fetching is not headless – sites that require JavaScript to display content or that block non-browser user agents can cause Newspaper’s .download() to fail or retrieve incomplete HTML (e.g., Cloudflare challenges result in an empty page). Users often had to set a custom User-Agent string or session cookies to work around this 41. In some cases, manual intervention or using an external fetch is needed for those pages.
- Performance and dependencies: Newspaper can be somewhat heavy – it uses lxml, and for NLP features it can use NLTK corpora (for stopwords). Initializing it might download some NLP resources. Also, parsing a very large number of articles can be slow or memory-intensive, though for ~3000 it should be fine with proper settings.

Use case: Newspaper3k is useful if you want a ready-to-use Python solution that not only extracts the main text but also gives you metadata and even a summary. It’s particularly suited for news articles or blog posts. For example, if you are building a news aggregator or doing data mining on news, Newspaper can fetch and parse each article conveniently. If you need Markdown or exact formatting, or if many target pages have paywalls or heavy JS, you might prefer a different approach. Also, consider using the updated fork (Newspaper4k) for better results in 2025.

Goose3 – Refined Goose Parser in Python

Goose3 is a Python 3 fork and continuation of the Goose extractor (originally in Java, then Scala, then Python). It is designed to pull out the main textual content of an article and related metadata. Goose3 will attempt to extract the article’s text, title, meta description, meta keywords, the main image, and even YouTube/Vimeo embeds from a page 42. It’s somewhat similar in goals to Newspaper, but with a different lineage. Goose’s algorithm is based on hitting the “content sweet spot” – it looks for the HTML element that contains the bulk of text (many <p> tags) and assumes that’s the article body, then cleans it. Goose3 is actively maintained, with recent improvements to author extraction and other tweaks (commits as of late 2024) 43. It doesn’t include NLP extras, focusing mainly on the content and media extraction.

Pros:
- Focused on content: Goose3’s primary aim is to correctly identify and return the main article text. It often works well on news sites and blogs, identifying the right container element for the story.
- Basic metadata: It will capture things like the meta description, meta keywords, and the first large image in the content (often the lead image of an article) 44. Some recent versions also attempt author extraction from common patterns (though this might not be as thorough as dedicated approaches).
- Pythonic usage: It’s easy to use: e.g., from goose3 import Goose; article = Goose().extract(url=...). You get an article object with properties like article.cleaned_text (main text), article.title, article.meta_description, article.top_image, etc. This is convenient for storing results or further processing in Python.
- Batch processing: Goose3 can be used in a loop or with a context manager for efficiency (to reuse the parser for multiple URLs) 45. It’s designed to handle lots of articles; you can adjust settings (like timeouts, user agent) via a configuration object for robustness 46 47.
- Active maintenance: Unlike some older libraries, Goose3 has received updates (bug fixes, Python 3 support improvements, etc.) up to 2025. This means better compatibility with current websites and Python versions.

Cons:
- No Markdown or rich text out-of-box: Goose3’s cleaned_text is plain text without HTML tags. It preserves paragraphs (usually separated by newlines), but formatting like lists or bold text will be lost. There isn’t a built-in Markdown output. You’d need to access article.clean_html (if available) or the DOM and convert manually if you wanted Markdown with structure.
- Metadata not as rich as some: Goose3’s metadata extraction covers basics (description, keywords, image). It doesn’t automatically give you publish date or author in a structured way, except if those were in standard meta tags (and you might have to parse them from the generic metadata). So, for title and text it’s great, but for things like date you might need to manually parse the page’s meta or schema (or use another library in conjunction).
- Occasional mis-identification: No algorithm is perfect – Goose can sometimes pick the wrong element if the page has unusual structure (e.g., long sidebar content). It may include some junk if the heuristics mistake it as part of the article. Tuning configuration (like content length thresholds) or post-cleaning might be needed in some edge cases.
- Not JS-aware: Like Newspaper, Goose3 just does an HTTP fetch (with the option to set a browser-like User-Agent string 48). It won’t navigate or run scripts. If a site shows a teaser and requires clicking “read more” (with JS) to load the rest, Goose will only see the teaser in the HTML. Similarly, paywalled content that isn’t in the initial HTML won’t be extracted.

Use case: Goose3 is a good choice for Python users who need a lightweight, actively-maintained article extractor focusing on text content and basic meta. If you have a diverse set of sites and want a reliable heuristic approach, Goose3 is solid. It might slightly trail Newspaper in terms of built-in features, but it makes up for that with current maintenance. Use it when you need to process many pages in Python and prefer an open-source library with an established, simple API.

Trafilatura – Robust All-in-One Text & Markdown Extractor

Trafilatura is a Python package and CLI tool that represents the state-of-the-art in open-source content extraction. It is specifically designed to turn raw HTML into clean, structured data, and it includes many features out of the box 49 50. Trafilatura can download webpages (with polite crawling policies), parse them, extract the main text, and also grab metadata like title, author, publication date, site name, and even comments if needed 51. It preserves text structure (paragraphs, headings, lists, etc.) and can output directly to multiple formats – including plain text, JSON, XML, and Markdown 52. Under the hood, it uses a mix of content extraction algorithms (it mentions using jusText, Readability, and custom heuristics) to maximize accuracy 53. Notably, Trafilatura has been evaluated in research and benchmarks and has consistently outperformed other libraries in precision/recall of content extraction 54 55. It’s actively maintained and widely used by academic and industry projects (HuggingFace, IBM, Microsoft Research, etc.) 56.

Pros:
- Excellent extraction accuracy: Trafilatura’s multi-algorithm approach and continuous tuning give it an edge in capturing all of the main content while excluding boilerplate. Benchmarks (ScrapingHub’s and academic studies) have rated it as one of the best (if not the best) open-source content extraction tools 57. This means for a wide variety of sites, you’re likely to get the full article text without extras.
- Direct Markdown output: Trafilatura can produce the content in Markdown directly, preserving structure such as headings, lists, emphasis, blockquotes, code blocks, etc., where applicable. This meets the “clean Markdown” criterion readily – the output is ready to use in Markdown format (or you can get it as text/HTML too, if needed) 58.
- Rich metadata and optional content: It automatically extracts common metadata (title, author, date, site name) and even tags/categories if present 59. All this information can be returned in a JSON along with the content. You can also enable extraction of comments or tables if you want them, or keep them off by default. This flexibility is great for different use cases.
- Batch and CLI friendly: Trafilatura includes a command-line interface that can take a list of URLs or files and process them in batch. It also supports sitemaps and RSS feeds for crawling multiple pages 60. In code, it’s as simple as trafilatura.fetch_url(url) to download and trafilatura.extract(html) to get content, or a one-liner trafilatura.scrape_url(url) to do both. It’s designed with large-scale usage in mind (including throttling, parallelization options).
- Actively maintained and documented: The project sees regular updates and has comprehensive documentation and even an evaluation framework. Being used by big organizations lends confidence to its stability and longevity. It supports many languages and has been improved for international text extraction.

Cons:
- Performance overhead: Because Trafilatura packs a lot of functionality, it can be heavier than simpler tools. It might be slower on a per-page basis than a minimal Readability port, especially if you enable features like comments or run it with default (safe) settings that do extra checks. However, it’s still reasonably fast and can be tuned for speed.
- Complexity: With many features comes a bit more complexity. The number of options and outputs (XML, JSON, Markdown, etc.) means a learning curve to fully utilize it. If you just need basic text, you might be using only a fraction of its capabilities.
- No built-in headless JS: Trafilatura uses requests and its own parsing; it doesn’t render JavaScript. So like the other open-source tools, it will not magically get content that isn’t present in the initial HTML. You may still need to couple it with a solution for dynamic sites (fetching those pages via Selenium/Playwright and then feeding the HTML to Trafilatura for extraction).
- Python-only: It’s a Python tool. There is no direct Node.js equivalent. (However, the CLI could be invoked from other environments if absolutely needed, or one could run it as a service.)

Use case: Trafilatura is ideal when you need the best accuracy and completeness in content extraction and you want a ready-to-go solution in Python. For example, if you plan to regularly process thousands of articles from varied sources and want to minimize the chance of missing content or including boilerplate, Trafilatura is a top pick. It’s also great if you want Markdown output with minimal effort. Academics building corpora or NLP datasets, or developers feeding data into machine learning models, often use Trafilatura for its quality and metadata inclusivity 61. It might be overkill for a small one-off script, but it shines in ongoing and large-scale projects.

MarkItDown – Universal Document to Markdown Converter (Microsoft)

MarkItDown is an open-source Python tool released by Microsoft in 2024 that converts a variety of file formats into Markdown. Unlike the other tools here, MarkItDown is not focused solely on HTML or web articles – it’s a general converter for PDFs, Word documents, PowerPoint slides, Excel sheets, images (via OCR), audio (via speech-to-text), and HTML/JSON/XML, etc. 62 63. The goal is to facilitate feeding all sorts of documents into Large Language Models or text analysis pipelines by normalizing them to Markdown 64. For HTML content, MarkItDown will parse the HTML and output an equivalent Markdown representation, preserving the structure (headings, lists, links, tables). It does not perform article-centric cleaning – it converts everything given to it. In essence, if you feed it a whole webpage HTML, you’ll get a Markdown version of the entire page (navigation, ads, and all) unless you pre-clean it. MarkItDown has become extremely popular (over 25k stars on GitHub in just a few weeks) due to its broad utility 65.

Pros:
- Broad format support: It can handle many input formats beyond just web HTML – if your workflow involves PDFs or Word docs as well as web pages, MarkItDown gives a unified solution to convert all to Markdown. This versatility is unique among the tools compared here.
- Preserves document structure: When converting HTML, it keeps the semantic structure as much as possible (headings become #, lists become - or 1., bold/italic become ** or *, etc.). It aims for a faithful Markdown rendition of the source. This is great if you want to maintain formatting (like tables or lists from the source).
- CLI and library available: You can use MarkItDown via command-line for single files or in a Python script (MarkItDown().convert(file_or_path)). It’s straightforward to integrate.
- Actively maintained: As a new project backed by Microsoft’s open-source team, it’s under active development. It has an engaged community, and issues/feature requests are being addressed rapidly (as of 2025).
- Use with LLM workflows: If your end goal is to supply content to language models or analysis tools, MarkItDown’s output is optimized for that (it’s “token-efficient” and structured 66 67). It also has an optional server mode for providing context to LLMs.

Cons:
- No content extraction: MarkItDown will not filter out nav bars, ads, or boilerplate. It’s not an article extractor – it’s a converter. If you give it an HTML of a blog post that still contains the header, sidebar, related links, etc., all those will faithfully appear in the Markdown output as well (likely as unneeded text or links). In other words, it doesn’t know what the “main content” is; it just converts everything. This is a major distinction from the other tools and means MarkItDown alone doesn’t meet the “clean content without extraneous elements” criterion unless you combine it with another tool.
- Limited metadata handling for HTML: MarkItDown doesn’t specifically pull out title or author from HTML in a structured way. If the HTML includes them in the text or metadata, they might just end up in the Markdown as part of the content. There’s no specialized logic to label or output those separately for web pages. (Its focus was on document structure, not metadata extraction.)
- Potentially low fidelity for complex layouts: While it preserves basic structure, very complex web layouts might not translate neatly to Markdown. MarkItDown tries to preserve content meaningfully, but for web pages with interactive elements or scripts, it’ll omit those or include fallback text. It may also not differentiate primary content vs. boilerplate (as noted). So the Markdown could still require manual cleaning for readability.
- Python environment needed: It runs on Python, which is fine for most, but if your environment is Node.js only, you’d have to call it separately or use an alternative.

Use case: MarkItDown is excellent if you have a variety of document types to process and you want them in Markdown (for example, feeding an LLM with a mixture of HTML articles, PDFs, and Word files). In the context of web content extraction, MarkItDown would play the role of a post-processor – you’d first extract the main article HTML using another tool (like Defuddle, Mercury, or Trafilatura), then use MarkItDown to convert that HTML to Markdown. This two-step approach could yield very clean results. If you try to use MarkItDown directly on raw URLs, it won’t handle the HTTP fetching or cleaning – you’d need to fetch HTML yourself and possibly strip boilerplate. In summary, use MarkItDown in combination with other tools (not standalone) when you need high-fidelity Markdown conversion across different file types.

Notable Commercial Options

While open-source tools are powerful, some scenarios (like heavy paywalls or sites aggressively blocking scrapers) might call for a commercial service. Here we include two affordable, high-quality options that are Mac-friendly (API-based, platform agnostic):

Instaparser (Instapaper API) – Instapaper’s Parser as a Service

Instaparser is an API provided by the read-it-later service Instapaper, which allows developers to leverage Instapaper’s article parsing engine 68. Essentially, it’s the same parser Instapaper uses to clean up web pages for reading on mobile devices, now offered as a paid API. Instaparser actually offers three endpoints 69: - Article API: returns a comprehensive parse with all body text, images, and videos – basically the full content in a cleaned HTML/JSON form.
- Text API: returns just the plain text of the page’s main content (no HTML, good for NLP).
- Document API: allows you to submit raw HTML content that might not be publicly reachable (e.g., behind a login or paywall) and get it parsed.

The API outputs JSON with fields like title, content, etc. It’s not explicitly stated to return Markdown, but it returns either HTML content or plain text. You could convert HTML to Markdown if needed. Pricing is designed for scalability: a free trial (~1000 calls/month) and then plans starting at $99/month for 100k calls (about $0.001 per article) 70.

Pros:
- High quality parsing: Instapaper has a reputation for excellent content parsing (similar to Mercury/Readability quality). It handles a wide range of sites since it’s tuned for what a human reader would want. It will strip most ads and navigation, yielding just the article (plus images or videos in the content).
- Handles paywalled content (with help): The Document API is a unique feature – if you can fetch the HTML of a paywalled page (for example, after logging in or via some other means), you can send that HTML to Instaparser and get the cleaned content back 71. This means the parser itself doesn’t need to circumvent the paywall; you do, but it will parse whatever you give it. It’s a useful workaround for protected content that other tools can’t handle by themselves.
- Metadata and media: The Article API returns not just text but also content structure including images and embedded videos. If you need the main image or want to preserve the fact that an image or video was in the article, this API includes that information. Many open-source tools drop those or only give you a URL for the top image.
- Scalable and scriptable: Being an HTTP API, you can integrate it into scripts easily (Python requests or Node fetch). You can process thousands of URLs – the rate limit (depending on plan) might allow a few per second, which is fine for batch jobs. The JSON output is structured and easy to parse.
- Instapaper’s ongoing support: Instapaper (now independent) likely keeps this parser up to date as the web evolves (they need it for their own app). So you benefit from continuous improvements without doing anything.

Cons:
- Cost for volume: While $99/month for 100k articles is reasonable per article, it’s still a recurring cost. If you only have ~3000 initial URLs and then, say, a few hundred a month, you might not hit the quota but you’re still paying for the plan. There’s no cheaper plan beyond the free trial. So if your usage is modest (a few thousand pages total), you might not justify the cost vs. using an open-source solution you run yourself.
- No direct Markdown: Instaparser doesn’t return Markdown directly. You get either HTML content or plain text. You’d have to convert HTML to Markdown if that’s the desired final format. This is an extra step (though fairly straightforward with a library).
- External dependency: Relying on an API means you need internet access and have to consider API latency. Each call is going over the network, which can be slower than local parsing, especially if processing thousands of URLs (though you can parallelize to some extent). Also, if Instapaper were to change their API or the service becomes unavailable, that could disrupt your workflow.
- Privacy considerations: You are sending URLs (or HTML content) to a third-party (Instapaper) to parse. For sensitive content, this might be a concern. With open-source tools, everything stays local.

Use case: Instaparser is great for a plug-and-play, high-accuracy solution when you don’t mind using a paid API. If you’re building an app on macOS (or any platform) that needs to parse articles on-demand (like a reader app, or a research tool) and you want reliable results without maintaining your own parser, Instaparser is a strong option. It especially shines if you already have to handle user-specific content (where you might fetch pages with credentials and then need them cleaned – the Document API can take that from there). For one-off large batches, the cost might be a drawback, but for ongoing use where quality is paramount and you want to save development time, Instaparser delivers a proven solution.

Apify Article Extractor – Headless Browser Parsing in the Cloud

Apify is a web scraping/cloud automation platform, and they provide an Article Extractor actor (script) that you can use via their API or by running on Apify’s cloud. This solution uses a headless Chromium browser to load the page just as a normal user would, then applies content extraction logic to pull out the article text and metadata. Under the hood, Apify’s Article Extractor uses the open-source unfluff library for parsing content 72, combined with Apify’s web scraping capabilities (like proxy management, browser automation). The output includes the article’s title, author, publication date, content in both HTML and plain text, a short description (excerpt), and more 73. Notably, it can return a Markdown version of the content for convenience 74. Apify offers a free tier (with a certain number of computing seconds per month) and then pay-as-you-go pricing, which for a few thousand articles is likely only a few dollars. Being cloud-based, you can run it from any environment (it’s essentially platform agnostic; you just make API calls or use their SDK).

Pros:
- Handles dynamic and blocked sites: Because it uses a real headless browser, this tool can handle pages that require JavaScript to render or that aggressively block bots. The headless browser fetch will run any client-side scripts, bypass cookie consent walls, load content that appears on scroll, etc. This means even “difficult” sites (SPA frameworks, sites showing content after a delay, or those with anti-scraping measures) can be successfully scraped. In combination with Apify’s proxy options, it can also bypass simple IP or geo-blocks.
- Clean content with Markdown: The Article Extractor cleans the page content and can directly return it in Markdown format. According to Apify, it “supports rich formatting using Markdown” and cleans the HTML 75. This gives you the main text ready to use, much like the open-source tools do, but with the advantage of having been executed in a browser environment.
- Full metadata: The output JSON includes title, author, date, content in HTML and text, and an automatically generated short description (usually the first few lines or the meta description) 76. This is useful to have all in one response. It’s comparable to what Mercury or Trafilatura would give, for example.
- Ease of use for batch: You can send a single API request with a list of URLs to process (the Apify actor can take multiple URLs in one run). Apify’s platform will then browser-fetch each and parse them. This is convenient for processing e.g. 3000 URLs in one go – you don’t have to orchestrate threading or headless browsers yourself. They also provide client libraries (in Node, Python, etc.) to interact with the API.
- No maintenance burden: Using Apify’s hosted solution means you don’t have to worry about updating parsing rules or managing infrastructure. If unfluff were to misparse something, Apify might update their actor. And the headless Chrome is managed on their end, so you skip dealing with installing Chrome or dealing with browser crashes in your own environment.

Cons:
- Cost and rate limits: While likely affordable for a few thousand articles, running a headless browser for each page is heavier than a simple HTTP fetch, so it costs more in compute time. Apify charges by computing time and memory usage. For example, if each page takes a couple of seconds to render and parse, 3000 pages might be on the order of a few hours of compute. This could still be under the free tier or just a few dollars, but it’s something to monitor. Also, throughput is limited by how many browser instances can run in parallel on your plan – it may take some time to process thousands of URLs sequentially (though you can run multiple in parallel with higher plan or by splitting the job).
- External service dependency: As with any API, you rely on Apify’s service availability. However, Apify is a well-established platform and downtime is rare. You also need to obtain an API key and possibly manage usage through their dashboard, which is an extra step compared to purely local solutions.
- Data leaving your system: You’ll be sending the URLs (and the content loaded in a browser, which technically means Apify’s servers are accessing that content on your behalf). If the pages are public, this is usually fine. For any sensitive or login-required content, you would either not use this or use Apify’s more advanced features to log in (which is possible, but beyond the simple use-case and would incur more setup).
- Slightly slower per page: Because a real browser is used, it’s inherently slower than just fetching HTML and parsing. If speed is crucial and the pages are simple, this overhead might not be worth it. But for modern web pages, the headless approach sacrifices some speed for completeness.

Use case: Use Apify’s Article Extractor when you need a robust, hassle-free way to handle any website, including those that defeat normal scrapers. For instance, if a significant portion of your 3000 URLs are from sites that require login or heavy JS (say, some are behind paywalls or use infinite scroll), an open-source library would struggle, whereas Apify can get them. It’s also a good choice if you don’t want to write scraping code – you can just call this API and get JSON back. If you’re on a Mac and don’t want to deal with installing headless Chrome, offloading to Apify is convenient. In terms of cost, it’s pay-per-use, so you can estimate and control it (and there is a free tier to experiment).

(Aside: Other web scraping APIs like ScrapingBee, Zyte, etc., can also render JS and then you could apply an extraction algorithm. Apify’s solution is highlighted here because it bundles the extraction logic for you.)

Recommendations and Scenario-Based Picks

Finally, here are recommendations tailored to different scenarios, summarizing which tool(s) might be the best fit:

Scenario 1: Open-Source Only Solution

If you prefer to stick with open-source tools and run everything locally (to avoid costs and external dependencies), Trafilatura is arguably the top choice for accurate, clean content extraction with Markdown output. It provides the full package (content + metadata + Markdown) and is actively maintained, which is ideal for ongoing use. For a Python-based workflow, Trafilatura would likely give the highest success rate across a variety of sites 77.

In a Node.js environment, a combination of Defuddle and a conversion tool could be used – for example, Defuddle to get cleaned HTML (and metadata) 78, then maybe MarkItDown or Turndown to ensure well-formatted Markdown (if not using Defuddle’s built-in Markdown option). Mercury Parser is also a reliable open-source choice in Node.js if you want a one-step conversion to Markdown 79 and can accept its slightly dated maintenance status; it still performs well for many article sites and is simpler to set up than orchestrating multiple tools.

Recommendation: For open-source only, Trafilatura (Python) for an all-in-one solution is recommended for best quality and features. If constrained to Node.js, Mercury Parser (with potential fallbacks or rule tweaks) or Defuddle (for a more modern take) are strong options for Markdown output and metadata. You may need to supplement these with a headless browser fetch in the rare cases of highly dynamic sites.

Scenario 2: Best Accuracy and Coverage of Content

If your priority is to extract content as completely and accurately as possible (minimizing cases of missing text or including boilerplate), you should leverage the strongest parser, possibly even multiple for cross-checking. Trafilatura has demonstrated the best overall accuracy in benchmarks 80, so it stands out among open-source tools. It tries to balance precision and recall effectively, which means you get nearly all the relevant text with minimal noise.

On the commercial side, Instaparser is likely on par with these top open-source algorithms given Instapaper’s long focus on readability. It’s a paid service, but it will handle a wide range of content types effectively (especially if you need the images and such). However, since Instaparser doesn’t inherently handle dynamic sites beyond what’s in HTML, for maximum coverage you might need a headless solution.

For absolute completeness (including those pesky edge cases of dynamic content), using a headless browser approach is the fail-safe. This could mean employing Apify’s Article Extractor or running your own headless browser + Mercury/Readability. For example, you could automate Chrome/Playwright to open each page, wait for it to load, and then run Mozilla’s Readability script in-page to get content. That approach is more involved but ensures that even content loaded via JS is captured before extraction. Apify essentially provides this as a service.

Recommendation: For best accuracy with minimal manual effort, Trafilatura is the top open-source pick. You could augment it by running a headless fetch for any URLs that Trafilatura alone fails on (then feeding the HTML to Trafilatura or another parser). If budget allows and convenience is key, use Instaparser for its reliable parsing and consider Apify (or a custom headless pipeline) for any content that Instaparser can’t normally see. In practice, a combination could be: try Trafilatura or Mercury first, and if an article’s content comes back unusually short or blank (indicating failure due to dynamic content or blockage), fall back to a headless approach for that URL. This hybrid strategy will maximize content coverage.

Scenario 3: Handling Paywalls and Bot-Blocked Sites

When a significant number of target URLs are behind paywalls or have anti-scraping measures, the approach needs to be more specialized. No out-of-the-box open-source tool will bypass a hard paywall (e.g., one that requires login or where content isn’t in the HTML). However, some paywalls are “soft” – they show the content in the HTML but overlay a banner or use CSS/JS to hide after a few paragraphs. For such cases, a parser like Mercury or Trafilatura often still extracts the full text because it ignores styling and scripts (thus can capture the hidden text). So first, determine if the paywalled content is actually present in the HTML (just concealed). If yes, a normal extractor may work. If not, you need to simulate access.

Instaparser’s Document API can be useful if you can programmatically obtain the HTML (for example, maybe you have subscriptions and can fetch pages with session cookies or an API). You’d fetch the page with proper credentials and then send that HTML to Instaparser to parse 81. This handles the parsing nicely, but you still have to handle authentication or circumventing the paywall manually.

For sites that aggressively block bots (via anti-scraping services or requiring JS), using a headless browser with rotation is effective. Apify is a straightforward way to do this; you might configure it to use residential proxies if needed and then run the Article Extractor. Another approach is to use an API like ScrapingBee or Zyte which can render JS – you could fetch the page through them and then run an extraction locally on the result. Some of these services even offer an “extraction” mode, though often not as content-specific as Mercury/Readability (they might require you to specify selectors).

If you want to stay open-source, you can use Playwright or Selenium on your Mac to automate a browser. For example, login to a paywalled site once (saving cookies), then for each URL, have the headless browser load it with those cookies, execute a script to pull document.body.innerHTML, and feed that into Trafilatura or Readability. This is the most labor-intensive route but gives you full control.

Recommendation: For paywalled and protected sites, leveraging headless browsers is key. If you prefer an easier route, use Apify’s Article Extractor or a similar scraping API that handles headless browsing and anti-bot measures; these will retrieve the content which you can then parse (Apify does parsing for you; others you might parse yourself). If you need to stick to open-source, consider using Playwright with Readability for those specific sites. Maintain a list of “difficult” domains and use a specialized fetching method for them. Once the content is fetched, you can still funnel it through your primary extraction tool for consistency. In summary, for paywalls: Apify is a top recommendation (because it automates the heavy lifting), and for DIY: a combination of Playwright + Trafilatura (fetch via Playwright, then extract with Trafilatura’s extract() on the rendered HTML) would cover almost all scenarios.

Other Considerations

  • Environment (Python vs Node.js): If your project is in Python, lean towards Python libraries (Trafilatura, Newspaper4k, Goose3, etc.) to minimize integration effort. If it’s Node.js/JavaScript, Node-oriented tools (Defuddle, Mercury, or using a JS DOM with Readability) will fit more naturally. Both ecosystems have viable solutions, so it often comes down to where you want to write your batch script or pipeline.
  • Batch Processing: All mentioned solutions can handle batch processing one way or another, but for sheer ease of batch, Trafilatura’s CLI can read a list of URLs directly, and Apify can take a list of URLs in one API call. If you prefer not to code at all, Apify’s ready-made actor via a web interface or API might be appealing. Otherwise, a simple Python script with a loop and Trafilatura/Mercury will do the job in a controllable manner.
  • Maintaining Formatting vs. Pure Text: If preserving the article’s formatting (headings, lists, emphasis) in Markdown is important, prioritize tools that output Markdown or HTML. Newspaper and Goose3, for example, lose that structure by default. Trafilatura, Mercury (with MD output), Defuddle (with MD option) will retain structure. This can affect readability of the output Markdown.
  • Metadata needs: All tools can get the title reliably. For publication date and author: Defuddle, Trafilatura, and the commercial tools explicitly try to get those (via schema.org or common patterns). Mercury and Readability-based tools often do not extract date/author on their own. Newspaper and Goose try to get author (with varying success) and sometimes date. If those fields are critical, you might either choose a tool that supports them or supplement by manually parsing <meta name="author" ...> or <meta property="article:published_time" ...> from the page HTML.

Final Recommendation:

Considering all factors, a hybrid approach can often yield the best outcome: use a fast, open-source parser for the bulk of straightforward pages, and fall back to a headless solution or API for the tough cases. For instance, you could start with Trafilatura (open-source, high accuracy) for the ~3000 URLs. You’ll get structured Markdown and metadata for most. Then identify any URLs where Trafilatura failed due to dynamic content (maybe by checking if output is empty or too short) and run those through Apify or a custom Playwright script. This way, you minimize cost and maximize speed, but still handle everything. If you prefer not to manage that complexity, and the volume is not huge, you could directly use Instaparser for all pages – it will reliably get the content and you’d just convert it to Markdown, at the cost of an API dependency and fee.

In summary, for a one-time or initial batch of 3000 pages, open-source tools like Trafilatura or Mercury/Defuddle + a bit of manual effort will likely suffice. For ongoing use with diverse sites, prepare to incorporate a headless browsing step (or use Apify) for the exceptional cases. This combined strategy ensures you meet all your criteria: clean Markdown output, metadata captured, batch processing, scriptability, and robustness against tricky sites. 82 83

References

Footnotes

  1. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  2. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  3. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  4. github.com buriy/python-readability: fast python port of arc90’s … - GitHub↩︎

  5. github.com buriy/python-readability: fast python port of arc90’s … - GitHub↩︎

  6. pypi.org newspaper4k - PyPI↩︎

  7. github.com goose3/README.rst at master · goose3/goose3 · GitHub↩︎

  8. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  9. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  10. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  11. dev.to Deep Dive into Microsoft MarkItDown - DEV Community↩︎

  12. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  13. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  14. instaparser.com Our Pricing Plans - Instaparser↩︎

  15. apify.com Article Text Extractor API in JavaScript - Apify↩︎

  16. apify.com Article Content Extractor · Apify↩︎

  17. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  18. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  19. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  20. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  21. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  22. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  23. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  24. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  25. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  26. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  27. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  28. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  29. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  30. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  31. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  32. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  33. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  34. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  35. feedbin.com The Future of Full Content - Feedbin↩︎

  36. github.com buriy/python-readability: fast python port of arc90’s … - GitHub↩︎

  37. github.com goose3/README.rst at master · goose3/goose3 · GitHub↩︎

  38. reddit.com I forked Newspaper3k, fixed bugs and improved its article parsing …↩︎

  39. pypi.org newspaper4k - PyPI↩︎

  40. pypi.org newspaper4k - PyPI↩︎

  41. stackoverflow.com Cannot download article using newspaper3k - python - Stack Overflow↩︎

  42. github.com goose3/README.rst at master · goose3/goose3 · GitHub↩︎

  43. github.com Releases · goose3/goose3 - GitHub↩︎

  44. github.com goose3/README.rst at master · goose3/goose3 · GitHub↩︎

  45. goose3.readthedocs.io Quickstart — goose3 3.1.19 documentation↩︎

  46. goose3.readthedocs.io Quickstart — goose3 3.1.19 documentation↩︎

  47. goose3.readthedocs.io Quickstart — goose3 3.1.19 documentation↩︎

  48. goose3.readthedocs.io Quickstart — goose3 3.1.19 documentation↩︎

  49. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  50. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  51. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  52. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  53. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  54. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  55. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  56. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  57. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  58. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  59. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  60. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  61. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  62. dev.to Deep Dive into Microsoft MarkItDown - DEV Community↩︎

  63. dev.to Deep Dive into Microsoft MarkItDown - DEV Community↩︎

  64. dev.to Deep Dive into Microsoft MarkItDown - DEV Community↩︎

  65. dev.to Deep Dive into Microsoft MarkItDown - DEV Community↩︎

  66. github.com GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.↩︎

  67. github.com GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.↩︎

  68. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  69. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  70. instaparser.com Our Pricing Plans - Instaparser↩︎

  71. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  72. apify.com Source code · Article Text Extractor - Apify↩︎

  73. apify.com Article Content Extractor · Apify↩︎

  74. apify.com Article Text Extractor API in JavaScript - Apify↩︎

  75. apify.com Article Text Extractor API in JavaScript - Apify↩︎

  76. apify.com Article Content Extractor · Apify↩︎

  77. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  78. github.com GitHub - kepano/defuddle: Extract the main content from web pages.↩︎

  79. github.com GitHub - postlight/parser: Extract meaningful content from the chaos of a web page↩︎

  80. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  81. appadvice.com Instapaper launches a new Instaparser API for developers↩︎

  82. github.com GitHub - adbar/trafilatura: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML↩︎

  83. appadvice.com Instapaper launches a new Instaparser API for developers↩︎