ScrapeGraph
This notebook provides a quick overview for getting started with ScrapeGraph tools. For detailed documentation of all ScrapeGraph features and configurations head to the API reference.
For more information about ScrapeGraph AI:
Overview
Integration details
| Class | Package | Serializable | JS support | Package latest | 
|---|---|---|---|---|
| SmartScraperTool | langchain-scrapegraph | ✅ | ❌ | |
| MarkdownifyTool | langchain-scrapegraph | ✅ | ❌ | |
| LocalScraperTool | langchain-scrapegraph | ✅ | ❌ | |
| GetCreditsTool | langchain-scrapegraph | ✅ | ❌ | 
Tool features
| Tool | Purpose | Input | Output | 
|---|---|---|---|
| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON | 
| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text | 
| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON | 
| GetCreditsTool | Check API credits | None | Credit info | 
Setup
The integration requires the following packages:
%pip install --quiet -U langchain-scrapegraph
Note: you may need to restart the kernel to use updated packages.
Credentials
You'll need a ScrapeGraph AI API key to use these tools. Get one at scrapegraphai.com.
import getpass
import os
if not os.environ.get("SGAI_API_KEY"):
    os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")
It's also helpful (but not needed) to set up LangSmith for best-in-class observability:
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
Instantiation
Here we show how to instantiate instances of the ScrapeGraph tools:
from langchain_scrapegraph.tools import (
    GetCreditsTool,
    LocalScraperTool,
    MarkdownifyTool,
    SmartScraperTool,
)
smartscraper = SmartScraperTool()
markdownify = MarkdownifyTool()
localscraper = LocalScraperTool()
credits = GetCreditsTool()
Invocation
Invoke directly with args
Let's try each tool individually:
# SmartScraper
result = smartscraper.invoke(
    {
        "user_prompt": "Extract the company name and description",
        "website_url": "https://scrapegraphai.com",
    }
)
print("SmartScraper Result:", result)
# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])
local_html = """
<html>
    <body>
        <h1>Company Name</h1>
        <p>We are a technology company focused on AI solutions.</p>
        <div class="contact">
            <p>Email: contact@example.com</p>
            <p>Phone: (555) 123-4567</p>
        </div>
    </body>
</html>
"""
# LocalScraper
result_local = localscraper.invoke(
    {
        "user_prompt": "Make a summary of the webpage and extract the email and phone number",
        "website_html": local_html,
    }
)
print("LocalScraper Result:", result_local)
# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)
SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}
Markdownify Result (first 200 chars): [ScrapeGraphAI](https://scrapegraphai.com/)
PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up
Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}
Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}
Invoke with ToolCall
We can also invoke the tool with a model-generated ToolCall:
model_generated_tool_call = {
    "args": {
        "user_prompt": "Extract the main heading and description",
        "website_url": "https://scrapegraphai.com",
    },
    "id": "1",
    "name": smartscraper.name,
    "type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)
ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')
Chaining
Let's use our tools with an LLM to analyze a website:
pip install -qU "langchain[groq]"
import getpass
import os
if not os.environ.get("GROQ_API_KEY"):
  os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("llama3-8b-8192", model_provider="groq")
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain
prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a helpful assistant that can use tools to extract structured information from websites.",
        ),
        ("human", "{user_input}"),
        ("placeholder", "{messages}"),
    ]
)
llm_with_tools = llm.bind_tools([smartscraper], tool_choice=smartscraper.name)
llm_chain = prompt | llm_with_tools
@chain
def tool_chain(user_input: str, config: RunnableConfig):
    input_ = {"user_input": user_input}
    ai_msg = llm_chain.invoke(input_, config=config)
    tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
    return llm_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)
tool_chain.invoke(
    "What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)
AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})
API reference
For detailed documentation of all ScrapeGraph features and configurations head to the Langchain API reference: https://python.langchain.com/docs/integrations/tools/scrapegraph
Or to the official SDK repo: https://github.com/ScrapeGraphAI/langchain-scrapegraph
Related
- Tool conceptual guide
- Tool how-to guides