Types of File Formats in Hadoop with example.

Hadoop supports various file formats to store data efficiently, ensuring optimized storage and processing. Here's an expanded explanation of each type with examples:

  1. 1. TextFile:

    • Description: This is a basic, plain-text storage format where each line in the file represents a record.
    • Example:
    • Anamika Singh,Bangalore,35 Shekhar,Delhi,30
    • 5
  2. 2. SequenceFile:

    • Description: SequenceFile is a binary file format that stores key-value pairs. It's designed to be splittable and supports various compression codecs.
    • Example:
    • key1 value1 key2 value2
    • Key1 Value1 Key2 Value2
  3. 3. Avro:

  4. 4. Parquet:

    • Description: Parquet is a columnar storage file format, optimized for performance and space efficiency. It supports complex nested data structures and provides efficient compression.
    • Example: The data in Parquet is binary, so it won't be readable like text formats.

  5. 5. ORC (Optimized Row Columnar):

    • Description: ORC is another columnar storage file format similar to Parquet. It's designed for high performance and offers efficient compression, making it suitable for large-scale data processing.
    • Example: The data in ORC is stored in a binary format.

  6. 6. JSON:

    • Description: JSON (JavaScript Object Notation) is a human-readable text-based format for data interchange. It's widely used for data transmission between systems.
    • Example:
      {"name": "Anamika Singh", "age": 35, "city": "New Delhi"} {"name": "Ram", "age": 30, "city": "Bangalore"}

  7. 7. XML:

    • Description: XML (eXtensible Markup Language) is a markup language that encodes data in a text format. It's used for storing and transporting structured data.
    • Example:
    • <person> <name>Anamika Singh</name> <age>35</age> <city>New Delhi</city> </person>

  8. 8. Hive RCFile (Record Columnar File):

    • Description: RCFile is a columnar storage file format developed for Apache Hive. It stores rows in a columnar format to improve query performance.
    • Example: The data is stored in a binary format optimized for Hive queries.

Each of these file formats has its advantages and is suitable for specific use cases. When choosing a file format in Hadoop, consider factors such as data size, query performance, schema flexibility, and compression requirements.


Hope you enjoyed while reading these optimization tips. If you like then please Like, Comment and Share.


Thank You!


Comments

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1