File organization

Created time	@August 29, 2023 4:27 PM
Tags	A2C16Notes

Def: Refers to the data is stored in a file

Text file VS binary form

Text file: Data is stored in a file as a sequence of characters.

Binary file: The binary equivalent of data is stored as a sequence od bytes

Record stored in a text file

The number of data items per line must be known and the number of characters per item must be known.

Item seperator characters must be used

repeating lines defined be end-of-line character or characters

Direct access

Random access files

Serial files:

The records are put into the file chronologically, ie., according to when each record is produced. The one produced first is put into the file first, then the next one

Each new record is simply appended to the file so that the onlyordering in the file is the time order of data entry.

Typical example:

For a bank to record transactions involving customer accounts

Sequential files

A sequential file have records that are ordered!

In order to allow a sequential file to be ordered, there has to be a key for which the values are unique and sequential but not necessarily consecutive.

When a new record is to be added what need to do?

Read the file sequentially and each record is written to a new file.

This is continued until the appropriated position for the new record isreached.

The new record is then written to the new file before the remainrecords in the old file are copied in.

Direct-access files

Random-access files

Access can be to any record in the file without sequential reading of the file

Direct access can be achieved with a sequential file.

A separated index file is created which has two fields per record

The first field has the key field and the second field has a value for the position of this key field value in the main file.

The alternative is to use a hashing algorithm

What is hashing algorithm?

One simple hashing algorithm

If there is a numeric key field in each record

Choose a suitable number

Divide this number by the value in the key field

The remainder from this division then identifies the address in the file for storage of that record

The suitable number works best if it is a prime number of a similar size to the expected size of the file

collision: when the same address is calculated for different field values, it is usually referred to as collision.

Ways to handle collisionsUse a sequential search to look for a vacant address following the calculated one.Keep a number of overflow addresses at the end of the file.Have a linked list accessible from each address

Conclusion

Sequential 顺序，但不一定是连续的

Random 随机

Serial 序列：时间顺序

ChatGPT:

当我们谈论文件组织的时候，我们主要是在讨论如何在一个物理设备上存储和检索数据。在计算机系统中，有两种基本的文件组织方法：串行（Serial）和顺序（Sequential）。

串行文件组织中，记录是一个接一个地存储的，没有特定的顺序。这意味着如果你想找到一个特定的记录，你必须从头到尾地查看每一个记录，直到找到你需要的那个。这就像是在一堆杂乱的纸片中找一张特定的纸片。因此，串行文件组织对于大数据集来说可能不是最有效的，因为它需要大量的检索时间。

在顺序文件组织中，记录是按照一定的逻辑顺序（通常是键的顺序）存储的。这就像是一个有序的文件夹，每个文件都按字母顺序排列。如果你想找到特定的记录，你可以按照这个顺序快速找到它。顺序文件组织因其检索效率而被广泛地应用在各种场合。

总的来说，串行和顺序文件组织的主要区别在于它们存储和检索数据的方式。串行文件组织中，记录是随机存储的，而在顺序文件组织中，记录是按照特定的顺序存储的。

随机（Random）文件组织是另一种文件组织策略，又称为直接文件组织、散列文件组织。在这种组织方式中，记录按照其键值或者一些转换的键值直接存储到某个特定的物理位置上。这就使得可以直接访问任何特定的记录，而无需通过其他记录。

随机文件组织的优点是数据访问速度快。因为每个记录的存储位置是由它的键值确定的，所以可以直接定位到它，而无需扫描整个文件。这对于大数据集来说特别有用，因为它可以极大地减少数据检索的时间。

然而，随机文件组织也有一些缺点。首先，如果两个记录的键值映射到同一个位置，就会产生冲突，需要有一种方法来处理这种情况。其次，如果你不知道记录的键值，那么就很难找到它。最后，随机文件组织需要更复杂的算法和更多的存储空间，因为它需要存储每个记录的键值和位置信息。

总的来说，随机文件组织在需要快速访问数据的场合非常有用，但是它需要更复杂的数据结构和更多的存储空间。

Python字典结构

在Python中，字典（dict）是一种关联数组或哈希表，它使用键-值（key-value）对存储数据。Python中的字典在内部的文件组织上，最接近于随机（Random）文件组织。

这是因为Python字典中的项并非按照它们插入的顺序存储，而是根据键的哈希值决定其在内部存储的位置。当你尝试访问字典中的一个项时，Python会使用键的哈希值直接定位到这个项，而无需遍历整个字典。这使得字典的键值查找非常快速。

然而，值得注意的是，从Python 3.7开始，字典被改为了保持插入顺序，这意味着当你迭代一个字典的时候，项会按照它们被添加到字典的顺序返回。这个改变是对Python字典的实现做的优化，但它并不改变字典在内部是如何存储和查找数据的。在内部，Python字典仍然使用哈希表进行高效的数据查找，所以，它仍然最接近于随机文件组织。