Skip to main content

Web scraper using JSoup and Spring Boot

What is webscraping

Webscraping is a technique to extract or pull the data from a website to gather required information by parsing the HTML source of their websites, such as articles from news or books site, products information from online shopping sites or course information from education sites. There are many organisations who uses web scraper to provide the best experience to their customers, for example extract the price for a smartphone from multiple online websites and show their customers the best and cheap product URL.
We will learn here how to code a web scraper by developing a simple new scraper service.

News scraper

News scraper is used to extract the news articles or other related contents from a news site. Here we are going to create a web scraper application to pull the articles from news site.
Below are the operations provided by our news scraper service.
  1. List all the authors
  2. Search articles by author name
  3. Search articles by article title
  4. Search articles by article description
Below are the technologies we will use for the development.
  • Jsoup: Jsoup is a rich featured API to manipulate the HTML documents which we use to parse the HTML document and search the HTML tags or attributes to find the articles.
  • Java8: Java 8 reduces the development effort with it's lambdas and streams which we will use to search and other operations on the list of news articles.
  • Spring boot: Springboot is a framework used to develop the microservices. With springboot developers can majorly focus on business logic development instead of focusing of setting-up application and deployment environment to run it.  

Below is the structure of our application.

For our demo, we will use https://www.thehindu.com/archive. Our application will find all the articles link on this page and then for each link it will extract the article details using meta tags. Before starting the development we need to do some manual investigation on given website to identify the HTML tags and attribute which are used there to display the articles and those tag or attributes we need to configure to get the article details.

Maven Dependencies

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
          <groupId>org.jsoup</groupId>
          <artifactId>jsoup</artifactId>
          <version>1.10.2</version>
        </dependency>

application.properties

Here we configure the news site URL and the meta tag using which we will extract the article details. Below are the content of this file.
server.port= 8081
#news site URL
newspaper.thehindu.url=https://www.thehindu.com/archive
#timeout the parsing if it is not able to do in 10 seconds
newspaper.thehindu.parse.timeout.ms=10000
#meta tag name for author name
newspaper.thehindu.article.authortag=meta[property=article:author]
#meta tag name for title
newspaper.thehindu.article.titletag=meta[name=title]
#meta tag name for description
newspaper.thehindu.article.desctag=meta[name=description]
# article search tags
newspaper.thehindu.article.searchtags=div[class=tc1-slide],div[class=justin-text-cont]
#logging configuration
logging.level.org.springframework=ERROR
logging.level.com.ttj=DEBUG
logging.file=${java.io.tmpdir}/web-scraper.log

WebScraperEndpoint.java

This class is our REST service which exposes all required endpoints.
@RestController
@RequestMapping("/articles")
public class WebScraperEndpoint {
    
    @Autowired
    WebScraperService scraperService;
    
    //Search articles by author name
    @RequestMapping(value="/by-author/{authorName}", method = RequestMethod.GET, produces = "application/json")
    public List<Article> searchArticlesByAuthor(@PathVariable("authorName") String authorName) {
        return scraperService.searchArticlesByAuthor(authorName);
    }

    //List all the authors
    @RequestMapping(value="/authors", method = RequestMethod.GET, produces = "application/json")
    public List<String> listAuthors() {
        return scraperService.listAuthors();
    }

    //Search articles by title
    @RequestMapping(value="/by-title/{title}", method = RequestMethod.GET, produces = "application/json")
    public List<Article> searchArticleByTitle(@PathVariable("title") String title) {
        return scraperService.searchArticleByTitle(title);
    }

    //search articles by description
    @RequestMapping(value="/by-desc/{desc}", method = RequestMethod.GET, produces = "application/json")
    public List<Article> searchArticleByDescription(@PathVariable("desc") String desc) {
        return scraperService.searchArticleByDescription(desc);
    }
}

WebScraperService.java

This interface defines required operations to be implemented by scraper service.
public interface WebScraperService {
    public void loadContents() throws MalformedURLException, IOException;
    public List<String> listAuthors();
    public List<Article> searchArticlesByAuthor(String authorName);
    public List<Article> searchArticleByTitle(String title);
    public List<Article> searchArticleByDescription(String desc);
}

WebScraperServiceImpl.java

This service class load the article details from new site using spring bean lifecycle, so it will load the contents only once when application is started but this is not the real time scenario where you may be required to load contents periodically. Also it provides the business logic to list and search the articles.
@Service
public class WebScraperServiceImpl implements WebScraperService{
    private final Logger LOGGER = LoggerFactory.getLogger(this.getClass());
    
    private List<Article> articles = new ArrayList<>();
    
    @Value("${newspaper.thehindu.url}")
    private String newspaperUrl;
    @Value("${newspaper.thehindu.parse.timeout.ms}")
    Integer parseTimeoutMillis;
    @Value("${newspaper.thehindu.article.authortag}")
    String authorTagName;
    @Value("${newspaper.thehindu.article.titletag}")
    String titleTagName;
    @Value("${newspaper.thehindu.article.desctag}")
    String descTagName;

    @Value("#{'${newspaper.thehindu.article.searchtags}'.split(',')}")
    List<String> articleLinksSearchTags;
    
    public WebScraperServiceImpl() {
    }
    
    @PostConstruct
    @Override
    public void loadContents() throws IOException {
        LOGGER.info("loadContents()...start");
        articles.clear();
        List<String> articleDetailsSearchTags = Arrays.asList(authorTagName, titleTagName, descTagName);
        WebScraperHelper scraperHelper = new WebScraperHelper(newspaperUrl, parseTimeoutMillis, articleDetailsSearchTags, articleLinksSearchTags);

        LOGGER.info("Extracting article details...start");
                        
        scraperHelper.fetchAllLinkMetaDetailsFromPage()
        .thenAccept(list->{
            list.stream().filter(map->map.get(authorTagName)!=null && map.get(authorTagName).length()>0)
            .forEach(map->{
                articles.add(new Article(map.get(titleTagName), map.get(descTagName), map.get(authorTagName)));
            });
        });
        
        LOGGER.info("loadContents()...completed");
    }
    
    @Override
    public List<String> listAuthors() {
        return articles.stream().map(a->a.getAuthorName())
                .distinct()
                .collect(Collectors.toList());
    }

    @Override
    public List<Article> searchArticlesByAuthor(String authorName) {
        return articles.stream().filter(a->a.getAuthorName().equalsIgnoreCase(authorName))
                .collect(Collectors.toList());
    }

    @Override
    public List<Article> searchArticleByTitle(String title) {
        return articles.stream().filter(a->a.getTitle().startsWith(title))
                .collect(Collectors.toList());
    }

    @Override
    public List<Article> searchArticleByDescription(String desc) {
        return articles.stream().filter(a->a.getDescription().startsWith(desc))
                .collect(Collectors.toList());
    }

WebScraperHelper.java

This helper class uses Jsoup API to read the articles links and parse or extract the html tags.
public class WebScraperHelper {
    private final Logger LOGGER = LoggerFactory.getLogger(this.getClass());
    
    private String pageUrl;
    private Integer pageParseTimeoutMillis;
    private List<String> detailsSearchTag;
    private List<String> linksSearchTags;
    
    public WebScraperHelper(String pageUrl, Integer pageParseTimeoutMillis, List<String> detailsSearchTag,
            List<String> linksSearchTags) {
        super();
        this.pageUrl = pageUrl;
        this.pageParseTimeoutMillis = pageParseTimeoutMillis;
        this.detailsSearchTag = detailsSearchTag;
        this.linksSearchTags = linksSearchTags;
    }
    /**
     * This method uses main page url supplied in constructor and retrieves all the links from that page
     * which are coming under the tags expression supplied as links search tags and then fetches all the meta details for those pages
     * @return : returns a list of all articles with the details fetched using the links search tag supplied in constructor.
     */
    public CompletableFuture<List<Map<String, String>>> fetchAllLinkMetaDetailsFromPage(){
        List<Map<String, String>> metaDetailsList = new ArrayList<>();
        return CompletableFuture.supplyAsync(()->{
            try {
                Set<String> links = getAllLinksFromPage();
                return links;
            } catch (IOException e) {
                LOGGER.error("Error in getting links.", e);
            }
            return null;
        }).thenApplyAsync(links->{
            links.forEach(lnk->{
                CompletableFuture<Map<String, String>> detailsFuture = CompletableFuture.supplyAsync(()->{
                    try {
                        return getLinkDetails(lnk);
                    } catch (IOException e) {
                        LOGGER.error("Error in getting link details.", e);
                    }
                    return null;
                });
                try {
                    metaDetailsList.add(detailsFuture.get());
                } catch (InterruptedException | ExecutionException e) {
                    LOGGER.error("Error in extracting results after task completion.", e);
                }
            });
            return metaDetailsList;
        }).toCompletableFuture();
    }
    /**
     * Extracts article details from meta tag using the detailsSearchTag supplied in constructor.
     * @param url
     * @return
     * @throws IOException
     */
    private Map<String, String> getLinkDetails(String url) throws IOException{
        Document doc = Jsoup.parse(new URL(url), pageParseTimeoutMillis);
        Map<String, String> tagDetails = new HashMap<>();
        detailsSearchTag.forEach(tag->{
            tagDetails.put(tag, doc.select(tag).get(0).attr("content"));
        });
        return tagDetails;
    }
    /**
     * Fetches all the links from the page which matches the criteria for linksSearchTags supplied in constructor
     * @return
     * @throws IOException
     */
    private Set<String> getAllLinksFromPage() throws IOException {
        Document doc = Jsoup.parse(new URL(pageUrl), pageParseTimeoutMillis);
        return searchLinkTags(doc, linksSearchTags);
    }
    
    /**
     * Extracts the actual link from a tag
     * @param doc
     * @param searchTags
     * @return
     */
    private Set<String> searchLinkTags(Document doc, List<String> searchTags){
        Set<String> links = new HashSet<>();
        searchTags.forEach(tag->{
            Elements elems = doc.select(tag);
            elems.forEach(e->{
                links.add(e.select("a[href]").attr("href"));
            });
        });
        return links;
    }
}

Article.java

This is our DTO class which contains required properties and their getter/setter methods.
public class Article implements Serializable{
    private static final long serialVersionUID = 1L;

    private String title;
    private String description;
    private String authorName;
    
    public Article() {}
    public Article(String title, String description, String authorName) {
        super();
        this.title = title;
        this.description = description;
        this.authorName = authorName;
    }
    //getter methods
    //setter methods
}

Now our application is ready to run. Start the application and hit below URL in browser. You will get the similar output. It is showing all the authors which it has found and it can be used with other endpoint to search articles by author name.
  URLhttp://localhost:8081/articles/authors
  Output:
 ["Suhrith Parthasarathy & Gautam Bhatia","Legal Correspondent","Surendra","PTI","The Hindu Net Desk","AP","Reuters","Yuthika Bhargava","Ramya Kannan","Pradeep Kumar","Staff Reporter","Special Correspondent","Krishnadas Rajagopal","N.J. Nair","AFP"]
If you configure the Swagger UI, then you will see all the available endpoint as per below screenshot.
webscraper

Full source code is available at below GIT URL.
https://github.com/thetechnojournals/misc_codes/tree/master/web-scraper

Comments

  1. VarangaOfficial - варанга противогрибковое средство - все, что бы хотели знать об этом препарате. Воспользовавшись услугами нашего ресурса, вы сможете узнать полную, всеисчерпывающую информацию касающуюся представленного средства. Лично увидеть данные о клиническом тестировании геля, прочесть реальные отзывы пациентов и врачей. Ознакомиться с инструкцией по применению, прочитать об особенностях и методах работы мази, осмыслить, как работает крем Варанга, где можно приобрести сертифицированный, оригинальный препарат и, как не нарваться на фальсификат. Мы тщательно проверяем публикуемые данные. Предоставляем пользователям нашего ресурса сведения, взятые исключительно из надежных источников. Если вы нашли у себя признаки развития грибка или же долго и безрезультатно стараетесь избавиться от этого коварного, неприятного недуга, у нас на сайте вы найдете легкий и быстрый способ устранения проблемы. Приобщайтесь и живите полноценной, здоровой жизнью. Теперь все ответы можно отыскать на одном сайте.

    ReplyDelete

Post a Comment

Popular Posts

Setting up kerberos in Mac OS X

Kerberos in MAC OS X Kerberos authentication allows the computers in same domain network to authenticate certain services with prompting the user for credentials. MAC OS X comes with Heimdal Kerberos which is an alternate implementation of the kerberos and uses LDAP as identity management database. Here we are going to learn how to setup a kerberos on MAC OS X which we will configure latter in our application. Installing Kerberos In MAC we can use Homebrew for installing any software package. Homebrew makes it very easy to install the kerberos by just executing a simple command as given below. brew install krb5 Once installation is complete, we need to set the below export commands in user's profile which will make the kerberos utility commands and compiler available to execute from anywhere. Open user's bash profile: vi ~/.bash_profile Add below lines: export PATH=/usr/local/opt/krb5/bin:$PATH export PATH=/usr/local/opt/krb5/sbin:$PATH export LDFLAGS=...

Why HashMap key should be immutable in java

HashMap is used to store the data in key, value pair where key is unique and value can be store or retrieve using the key. Any class can be a candidate for the map key if it follows below rules. 1. Overrides hashcode() and equals() method.   Map stores the data using hashcode() and equals() method from key. To store a value against a given key, map first calls key's hashcode() and then uses it to calculate the index position in backed array by applying some hashing function. For each index position it has a bucket which is a LinkedList and changed to Node from java 8. Then it will iterate through all the element and will check the equality with key by calling it's equals() method if a match is found, it will update the value with the new value otherwise it will add the new entry with given key and value. In the same way it check for the existing key when get() is called. If it finds a match for given key in the bucket with given hashcode(), it will return the value other...

Entity to DTO conversion in Java using Jackson

It's very common to have the DTO class for a given entity in any application. When persisting data, we use entity objects and when we need to provide the data to end user/application we use DTO class. Due to this we may need to have similar properties on DTO class as we have in our Entity class and to share the data we populate DTO objects using entity objects. To do this we may need to call getter on entity and then setter on DTO for the same data which increases number of code line. Also if number of DTOs are high then we need to write lot of code to just get and set the values or vice-versa. To overcome this problem we are going to use Jackson API and will see how to do it with minimal code only. Maven dependency <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.9.9</version> </dependency> Entity class Below is ...

Multiple data source with Spring boot, batch and cloud task

Here we will see how we can configure different datasource for application and batch. By default, Spring batch stores the job details and execution details in database. If separate data source is not configured for spring batch then it will use the available data source in your application if configured and create batch related tables there. Which may be the unwanted burden on application database and we would like to configure separate database for spring batch. To overcome this situation we will configure the different datasource for spring batch using in-memory database, since we don't want to store batch job details permanently. Other thing is the configuration of  spring cloud task in case of multiple datasource and it must point to the same data source which is pointed by spring batch. In below sections, we will se how to configure application, batch and cloud task related data sources. Application Data Source Define the data source in application properties or yml con...