DBENGINE: pgc tuning, replication tuning (#19237)
* evict, a page at a time
* 4 replication ahead requests per replication thread
* added per job average timings for workers and dbengine query router
* debug statement to find what is slow
* yield the processor to avoid monopolizing the cache
* test more page sizes in aral
* more polite journal v2 indexing
* pulse macros for atomics
* added profile so that users can control the defaults of the agent
* fix windows warnings; journal v2 generation yields the processor for every page
* replication threads are 1/3 of the cores and they are synchronous
* removed the default from the list of profiles
* turn pgc locks into macros to have tracing on the functions that called them
* log the size of madvise() when failing
* more work on profiles
* restore batch cache evictions, but lower the batch size significantly
* do not spin while searching for pages in the cache - handle currently being deleted pages within the search logic itself
* remove bottleneck in epdl processing while merging extents
* allocate outside the lock
* rw spinlock implemented without spinlocks; both spinlocks and r/w spinlocks now support exponential backoff while waiting
* apply max sleep to spinlocks
* tune replication
* r/w spinlock prefers writers again, but this time recursive readers are bypassing the writer wait
* tuning of rw spinlock
* more tuning of the rw spinlock
* configure glibc arenas based on profile
* moving global variables into nd_profile
* do not accept sockets that have not received any data; once sockets with data have been accepted, check they are not closed already before processing them
* poll_events is now using nd_poll(), resulting in vast simplification of the code; static web files are now served inline resulting in another simplification of the web server logic (was required because epoll does not support normal files)
* startup fixes
* added debug info to poll_fd_close()
* closed sockets are automatically removed from epoll(), by the kernel
* fix for mrg acquired and referenced going negative
* fixed bug in mrg cleanup, not deleting metrics that do not have retention
* monitor strings index size
* strings memory chart is now stacked
* replication: do not lock data collection when running in batches
* explicitly set socket flags for sender and receiver
* normalize the event loop for sending data (receiver and sender)
* normalize the event loop for receiving data (receiver and sender)
* check all sender nodes every half a second
* fix bug on sender, not enabling sending
* first cleanup then destroy
* normalize nd_poll() to handle all possible events
* cleanup
* normalize socket helper functions
* fixed warnings on alpine
* fix for POLLRDHUP missing
* fix cleanup on shutdown
* added detailed replication summary
* moved logs to INFO
* prevent crash when sender is not there
* madvise _dontfork() should not be used with aral, madvise_dontdump() is only used for file backed maps
* fix wording
* fix log wording
* split replication receiver and sender; add logs to find missing replication requests
* fix compilation
* fixed bug in backfilling, having garbage for counters - malloc instead of calloc
* backfilling logs if it misses callbacks
* log replication rcv and replication snd in node info
* remove contention from aral_page_free_lock(), but having 2 free lists per page, one for incoming and another for available items and moving incoming to available when the available is empty - this allows aral_mallocz() and aral_freez() to operate concurrently on the same page
* fix internal checks
* log errors for all replication receiver exceptions
* removed wrong error log
* prevent health crashing
* cleanup logs that are irrelevant with the missing replication events
* replication tracking: added replication tracking to figure out how replication missed requests
* fix compilation and fix bug on spawn server cleanup calling uv_shutdown at exit
* merged receiver initialization
* prevent compilation warnings
* fix race condition in nd_poll() returning events for deleted fds
* for user queries, prepare as many queries as half the processors
* fix log
* add option dont_dump to netdata_mmap and aral_create
* add logging missing receiver and sender charts
* reviewed judy memory accounting; adbstracted flags handling to ensure they all work the same way; introduced atomic_flags_set_and_clear() to set and clear atomic flags with a single atomic operation
* improvement(go.d/nats): add server_id label (#19280)
* Regenerate integrations docs (#19281)
Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>
* [ci skip] Update changelog and version for nightly build: v2.1.0-30-nightly.
* docs: improve on-prem troubleshooting readability (#19279)
* docs: improve on-prem troubleshooting readability
* Apply suggestions from code review
---------
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
* improvement(go.d/nats): add leafz metrics (#19282)
* Regenerate integrations docs (#19283)
Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>
* [ci skip] Update changelog and version for nightly build: v2.1.0-34-nightly.
* fix go.d/nats tests (#19284)
* improvement(go.d/nats): add basic jetstream metrics (#19285)
* Regenerate integrations docs (#19286)
Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>
* [ci skip] Update changelog and version for nightly build: v2.1.0-38-nightly.
* bump dag req jinja version (#19287)
* more strict control on replication counters
* do not flush the log files - to cope with the rate
* [ci skip] Update changelog and version for nightly build: v2.1.0-40-nightly.
* fix aral on windows
* add waiting queue to sender commit, to allow the streaming thread go fast and put replication threads in order
* use the receiver tid
* fix(netdata-updater.sh): remove commit_check_file directory (#19288)
* receiver now has periodic checks too (like the senders have)
* fixed logs
* replication periodic checks: resending of chart definitions
* strict checking on rrdhost state id
* replication periodic checks: added for receivers
* shorter replication status messages
* do not log about ieee754
* receiver logs replication traffic without RSET
* object state: rrdhost_state_id has become object_state in libnetdata so that it can be reused
* fixed metadata; added journal message id for netdata fatal messages
* replication: undo bypassing the pipeline
* receiver cleanup: free all structures at the end, to ensure there are not crashes while cleaning up
* replication periodic checks: do not run it on receivers, when there is replication in progress
* nd_log: prevent fatal statements from recursing
* replication tracking: disabled (compile time)
* fix priority and log
* disconnect on stale replication - detected on both sender and receiver
* update our tagline
* when sending data from within opcode handling do not remove the receiver/sender
* improve interactivity of streaming sockets
* log the replication cmd counters on disconnect and reset them on reconnect
* rrdhost object state activate/deactivate should happen in set/clear receiver
* remove writer preference from rw spinlocks
* show the value in health logs
* move counter to the right place to avoid double counting replication commands
* do not run opcodes when running inline
* fix replication log messages
* make IoT harmless for the moment
---------
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
Co-authored-by: Netdata bot <43409846+netdatabot@users.noreply.github.com>
Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>
Co-authored-by: netdatabot <bot@netdata.cloud>
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
2024-12-29 18:22:24 +00:00
|
|
|
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="1805px" height="765px" viewBox="-0.5 -0.5 1805 765" style="background-color: rgb(255, 255, 255);"><defs><clipPath id="mx-clip-7-61-322-26-0"><rect x="7" y="61" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-87-322-26-0"><rect x="7" y="87" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-113-322-26-0"><rect x="7" y="113" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-139-322-26-0"><rect x="7" y="139" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-165-322-26-0"><rect x="7" y="165" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-191-322-26-0"><rect x="7" y="191" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-217-322-26-0"><rect x="7" y="217" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-243-322-26-0"><rect x="7" y="243" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-269-322-26-0"><rect x="7" y="269" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-295-322-26-0"><rect x="7" y="295" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-321-322-26-0"><rect x="7" y="321" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-347-322-26-0"><rect x="7" y="347" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-373-322-26-0"><rect x="7" y="373" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-399-322-26-0"><rect x="7" y="399" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-425-322-26-0"><rect x="7" y="425" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-451-322-26-0"><rect x="7" y="451" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-477-322-26-0"><rect x="7" y="477" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-503-322-26-0"><rect x="7" y="503" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-529-322-26-0"><rect x="7" y="529" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-555-322-26-0"><rect x="7" y="555" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-581-322-26-0"><rect x="7" y="581" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-607-322-26-0"><rect x="7" y="607" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-633-322-26-0"><rect x="7" y="633" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-659-322-26-0"><rect x="7" y="659" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-685-322-26-0"><rect x="7" y="685" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-711-322-26-0"><rect x="7" y="711" width="322" height="26"/></clipPath><clipPath id="mx-clip-7-737-322-26-0"><rect x="7" y="737" width="322" height="26"/></clipPath><clipPath id="mx-clip-477-165-252-26-0"><rect x="477" y="165" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-191-252-26-0"><rect x="477" y="191" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-217-252-26-0"><rect x="477" y="217" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-243-252-26-0"><rect x="477" y="243" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-269-252-26-0"><rect x="477" y="269" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-295-252-26-0"><rect x="477" y="295" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-321-252-26-0"><rect x="477" y="321" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-347-252-26-0"><rect x="477" y="347" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-373-252-26-0"><rect x="477" y="373" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-399-252-26-0"><rect x="477" y="399" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-425-252-26-0"><rect x="477" y="425" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-451-252-26-0"><rect x="477" y="451" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-577-252-26-0"><rect x="477" y="577" width="252" height="26"/></clipPath><clipPath id="mx-clip-477-603-252-26-0"><rect x="477" y="603" width="252" heigh
|