WebScraping: Store as JSON or in Relational DB

Exploring solutions like NoSQL (MongoDB and elasticsearch) vs MySQL or MSSQL when dealing with semi-structured data from the web

Recently I wanted to store lots of images from the web and the associated data about those images, and access them later to use for Machine Learning. The web pages had unstructured, varying, or semi-structured data, as tables, divs of varying fields and values, and this raises a few challenges.

Here’s the plan:

  1. Build a Web Scraper to collect the data from the pages
  2. Store the data, including images
  3. Access that data and start to build a Machine Learning algorithm

Today I’m Focusing on How to Best …
2. Store the data, including images and map the two against each other

Considerations

I needed to take data from sources that may have different fields and rows, and potentially keep them all because later they may reveal info about the photo, so we end up with the possibility that storing the data as JSON objects would be better,

In this example, I don’t want to pay on going fees to store or to a cloud provider based on compute at this early stage, therefore probably want a raw solution, as I can code and have some skill in being able to work with technologies like this. Even though I like cloud solutions because of as-a-service ready to go features.

Normally I go to Relational Databases as a standard to store data. But given that I never knew which data structures I thought I was going to be working with, I thought this a good opportunity to explore storing the data as JSON objects.

Questions

  • Q1: Store locally, or in a cloud somewhere – where are we going to store the photos and data
  • Q2: How are we going to store the photos and data
  • Q3: Developing the scraping tool to capture and store the info
  • Q4: Providing the data to a machine learning model later

Q1) Storing the Data as JSON Objects vs in a Relational Database

We can consider some advantages and disadvantages

JSON advantages

  1. Flexible data: storing structured data in a flexible and hierarchical format. It can accommodate complex and nested data structures, making it suitable for storing unstructured or semi-structured data.
  2. Flexible Development: aligns well with modern programming languages and frameworks and can be used in a variety of ways later
  3. No Schema Constraints: Unlike relational databases that enforce a fixed schema, JSON allows for dynamic and evolving data structures. This enables adding or modifying of fields without requiring schema alterations or data migrations.
  4. Performance: For certain use cases, especially those involving high read or write speeds, JSON can provide better performance. Retrieving and parsing JSON data can be faster than complex join operations in a relational database.

JSON Object disadvantages

  1. Lack of Standardization: JSON does not have a strict schema definition, can lead to data quality issues. Inconsistent data structures or missing fields may occur if there are no standardized rules for storing and accessing JSON data.
  2. Limited Query Capabilities: Relational databases excel in complex querying and aggregating data using SQL. JSON, on the other hand, typically relies on simple key-value lookups, making it less suitable for advanced query operations.
  3. Data Integrity Challenges: Relational databases enforce data integrity through referential integrity constraints and ACID (Atomicity, Consistency, Isolation, Durability) transactions. JSON, being schema-less, lacks these built-in mechanisms, making it more challenging to ensure data consistency and integrity.
  4. Storage Efficiency: In some cases, storing data as JSON can result in higher storage requirements compared to a well-designed relational database. JSON objects include field names with each record, which can increase data size, especially when dealing with large datasets.

I’ve Decided to Use JSON Objects, What Next.

So I’m going with JSON objects. Let’s look at some options for storing JSON – NoSQL/Document Storage

MongoDB

Mongo DB Atas

  • GCP [expand on this, because it has mongo db type solution] (firestore [expand and correct this]
    • This also opens the possiblity of usng python and exploring Google Cloud platform and it’s, so this was explored as an option too, python good but i am not as familiar, and apparently has issues with memeory and speed when reading and working with large json objects [provide a source/reference to this please]
  • https://community.playfab.com/questions/38131/best-way-to-store-large-amount-of-json-objects.html – Elasticserach
  • Mongo DB Atas
  • [List the other solutions that would be good]
  • Considering current traction:
    • I already have hosting account at interserver, which can have languages like ASP.NET and PHP, and databases mssql and mysql
    • With that account I have unlimited storage, traffic and data, so it would be good if I could somehow use this account without setting up another solution like the cloud providers platofmrs above that are likley to be a service with on going charges, but this is an experimental app that I can potentially just use as a basis for futher development later on
    • With this considered, does interserver or any other shared solutions I have currently setup, like Siteground, provide no sql facilities like Mongo DB as part of their offering, the answer is no. [List some of the other hosting providers that do offer mongo db hosting, but talk about the cost of doing this]
    • After all this an looking into ASP.NET and JSON storage, the option of storing json in table cells in mssql, and using LINQ to JSON library to handle json queries. [Talk about why this could be a good option for me, including the ability to query json data relatively well, while not needing a no sql database structure – therefore solving my issue of dealing with semi-structured data or evolving data, and list the other ways this could also be done with .net but say that I settled for LINQ to JSON because I have already had experience with this and will use this to reduce learning curve uncertainties and increase simplicty where possible]
    • [Provide an example like:
      • “You can use methods like JsonDocument.Parse() to parse the JSON string and then query the data using LINQ queries. Here’s an example using System.Text.Json:”
  • Photo Data Store – InterServer
  • Data Data Store – InterServer SQL Server
0 0 votes
Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

0
Would love your thoughts, please comment.x
()
x